On a GPU, where data lives matters as much as what you compute. Memory decisions usually separate a kernel that hits peak throughput from one that stalls.
Know the hierarchy
Global memory is large but relatively slow. Shared memory is small, on-chip, and fast, shared by threads in a block. Constant memory is cached and ideal for read-only values used by every thread. Registers are fastest of all but scarce. Good kernels move hot data up this hierarchy.
Coalesce global access
When the threads of a warp read consecutive addresses, the hardware serves them in one transaction. Scattered or strided access forces many transactions and wastes bandwidth. Laying out data so neighboring threads touch neighboring memory is often the single largest win.
Stage reused data in shared memory
If a block reuses the same values, load them once into shared memory and read from there. Tiled matrix multiplication is the classic example, turning many slow global reads into a few, then fast on-chip reuse.
Speed up transfers with pinned memory
Host-to-device copies from pageable memory are slow. Allocating pinned host memory lets the copy engine run at full bandwidth and enables overlap with compute through streams.
Pool allocations in production
Calling cudaMalloc on the hot path fragments memory and adds latency. Production systems preallocate a pool of buffers and reuse them, keeping the critical path allocation-free.
Key takeaways
- Match data to the right memory space in the hierarchy
- Coalesce global memory access for full bandwidth
- Stage reused data in shared memory
- Use pinned host memory for fast, overlappable transfers
- Preallocate and pool buffers in production