Below I answer your two requests one by one, based on the trace: /home/ubuntu/adi/ncprof/ncompass/examples/vllm_example/.nsys_traces/large_traces/nsys_h200_vllm_128x256_TP.nsys-rep


1) Device/stream breakdown: how to focus on the relevant GPU kernel activity

A. First, the “many devices” part is real: you have 4 GPUs

Your trace contains 4 CUDA devices (deviceId 0–3) and they all do substantial work.

From the kernel table aggregated by device:

deviceId kernel_count total kernel time (ms)
1 241,558 25,117 ms
3 241,558 24,953 ms
0 241,558 24,776 ms
2 233,102 22,774 ms

So: this is multi-GPU tensor-parallel style execution (consistent with your filename TP), where most major kernels appear on all GPUs.

B. Streams: despite “many”, the work concentrates heavily in a few streams

If your goal is “understand what kernels are being launched on the GPUs”, you can ignore most streams and focus on the dominant ones by total kernel time.

Top (deviceId, streamId) by total kernel time:

deviceId streamId kernel_count total_ms avg_us max_us
1 7 71,080 23,853 336 398,519
3 7 71,080 23,692 333 180,099
0 7 71,080 23,512 331 50,565
2 7 62,627 21,511 343 670,666
0 175 157,718 1,247 7.9 130
2 223 157,718 1,247 7.9 130
1 223 157,718 1,246 7.9 130
3 175 157,718 1,244 7.9 130

What this means practically:

C. A simple “focus plan” for kernel understanding

When you’re overwhelmed by devices/streams, do this:

  1. Pick one representative GPU first (say deviceId=0) since most heavy kernels replicate across GPUs in TP.
  2. Filter to the dominant stream (here: streamId=7) to understand the main forward pass.
  3. Only after that, look at: