Parallelize GPU benchmarks across instances, never within
Speeding up a wall-clock-bound GPU benchmark sweep when the timing is hardware-exclusive.
For benchmarks that measure achieved kernel throughput (CUDA Events + L2 flush + median over trials), running multiple agents on one GPU corrupts measurements — VRAM contention, nvcc collisions, and stray kernel launches mid-trial silently invalidate the peak-fraction number. The right axis to parallelize on is GPU instances, not threads on one card: shard the sweep so each cell (model, problem) lives on its own bare-metal GPU. GPU-hours stay constant — you trade dollars for wall-clock at parity, not for free. Bonus: shard by the slowest-changing axis (model, since each has its own auth/billing) so per-machine secret-shipping happens once per shard.
Before scripting a parallel sweep, grep the existing run scripts for the comment that says why they are sequential. The maintainer almost always has a reason; if it is timing integrity, sharding-across-instances is the only safe parallelism.