Below I answer your two requests one by one, based on the trace:
/home/ubuntu/adi/ncprof/ncompass/examples/vllm_example/.nsys_traces/large_traces/nsys_h200_vllm_128x256_TP.nsys-rep
Your trace contains 4 CUDA devices (deviceId 0–3) and they all do substantial work.
From the kernel table aggregated by device:
| deviceId | kernel_count | total kernel time (ms) |
|---|---|---|
| 1 | 241,558 | 25,117 ms |
| 3 | 241,558 | 24,953 ms |
| 0 | 241,558 | 24,776 ms |
| 2 | 233,102 | 22,774 ms |
So: this is multi-GPU tensor-parallel style execution (consistent with your filename TP), where most major kernels appear on all GPUs.
If your goal is “understand what kernels are being launched on the GPUs”, you can ignore most streams and focus on the dominant ones by total kernel time.
Top (deviceId, streamId) by total kernel time:
| deviceId | streamId | kernel_count | total_ms | avg_us | max_us |
|---|---|---|---|---|---|
| 1 | 7 | 71,080 | 23,853 | 336 | 398,519 |
| 3 | 7 | 71,080 | 23,692 | 333 | 180,099 |
| 0 | 7 | 71,080 | 23,512 | 331 | 50,565 |
| 2 | 7 | 62,627 | 21,511 | 343 | 670,666 |
| 0 | 175 | 157,718 | 1,247 | 7.9 | 130 |
| 2 | 223 | 157,718 | 1,247 | 7.9 | 130 |
| 1 | 223 | 157,718 | 1,246 | 7.9 | 130 |
| 3 | 175 | 157,718 | 1,244 | 7.9 | 130 |
What this means practically:
When you’re overwhelmed by devices/streams, do this: