← Back to posts

I forgot I bought 3 avocados, they're overripe, I can't eat them, I'm sad, and now I'm learning CUDA

6 min read
LifeTech

Concept of CUDA

CPU and GPU have separate memories. The workflow for doing accelerated device computation is:

  1. Allocate memory on GPU (cudaMalloc)
  2. Copy input data from CPU → GPU (cudaMemcpy HostToDevice)
  3. Launch kernel (GPU does the work)
  4. Copy results from GPU → CPU (cudaMemcpy DeviceToHost)
  5. Free GPU memory (cudaFree)

We call a kernel function the function that runs on a GPU. In our compiler team, we simply call this Device function. Apparently, some founding engineers decided this because it's the "central" or "core" of the computation.

image-20260226203535905

We should truly never give engineers decision-making on important things, because its definition collides with 872316487 other meanings.

We define this kernel function and add this __global__ keyword to mark a function as a device function.

__global__ void vecAdd(int *a, int *b, int *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

image-20260226204000517

And yes, as you see, kernels don't have return values, they use output as parameters instead.


Now before we jump into the kernels, calling convention, it's better for us to know the limitations of CUDA.

Hard Limits (All CUDA GPUs)

ResourceMax Value
Threads per block1024
Block dimensions$x = 1024, y = 1024, z= 64$
Total Size$xyz \leq 1024$
Grid dimensions$x = 2^{31} - 1, y = z = 2^{16} - 1$

image-20260226220113501

Let's define it to be something like

struct Dim3 {
	int x, y, z;
};

Dim3 threadIdx; // Thread index within block
Dim3 blockDim; // Block size, threads per block
// its true that:
// - (0 <= threadIdx.x < blockDim.x)
// - (0 <= threadIdx.y < blockDim.y)
// - (0 <= threadIdx.z < blockDim.z)
Dim3 blockIdx; // Block index within the grid
Dim3 gridDim; // Grid size (blocks per grid)
// its true that:
// - (0 <= blockIdx.x < gridDim.x)
// - (0 <= blockIdx.y < gridDim.y)
// - (0 <= blockIdx.z < gridDim.z)
int warpSize;

Now if you think about it, the maximum threads per launch is actually $1024 \times (2^{31} -1) \times (2^{16} - 1)^2$. But this is absurd. Actually, threads are virtual. Now, introduce the concept of SM (Streaming Multiprocessors).

🏢 Kitchen = GPU
│
├── 🔥 Stove (SM) — physical hardware, LIMITED (e.g. 68)
│   │
│   ├── 📦 Station (Block) — scheduled onto stoves
│   │   └── 🍳🍳🍳 Chefs (Threads)
│   │
│   ├── 📦 Station (Block) — multiple stations share one stove
│   │   └── 🍳🍳🍳 Chefs (Threads)
│   │
│   └── 🧊 Counter space = Shared Memory (split between stations)
│
├── 🔥 Stove (SM)
│   ├── 📦 Station...
│   └── 📦 Station...
│
└── 🔥 Stove (SM)
    └── ...
// RTX 3080: 68 SMs × 16 max blocks per SM = 1088 simultaneous blocks

kernel<<<1000000, 256>>>();  // 1 million blocks launched
                              // only 1088 run at once
                              // remaining 998,912 wait in queue
ArchitectureCCMax Threads per SMMax Blocks per SMMax Warps per SMRegister File per SMMax Registers per ThreadShared Memory per SMMax Shared Memory per Block
Ada Lovelace (RTX 40xx)8.91,536244865,536 (64K) 32-bit255100 KB99 KB
Ampere (RTX 30xx consumer / RTX Axx)8.61,536164865,536255100 KB99 KB
Ampere (A100 / A40 etc.)8.02,048326465,536255164 KB163 KB
Hopper (H100 / H200)9.02,048326465,536255228 KB227 KB
Blackwell (datacenter B100/B200/GB200)10.02,048326465,536255228 KB227 KB
Blackwell (consumer RTX 50xx series)12.01,536324865,536255128 KB99 KB

CUDA calculates “how many blocks fit on ONE SM?” It takes the minimum of these limits:

Limiting factorHow it’s calculated (Blackwell example)Typical limit
Hardware max blocks24 (consumer) or 32 (datacenter)24 or 32
Shared memoryfloor( sharedMemPerSM / (sharedPerBlock + 1 KB overhead) )e.g. 228 KB / 10 KB = 22 blocks
Registersfloor( 65,536 registers / registers per block )varies
Threadsfloor( 1536 or 2048 threads / threads per block )varies

The shared memory you requested is often the one that decides the final number.

img

KeywordMeaning
__global__Callable from host, runs on device (the kernel)
__device__Callable from device only, runs on device (helper function)
__host__Callable from host, runs on host (normal CPU function)
KeywordScopeLifetimeSpeed
__shared__BlockBlockFast
__constant__All threads (read-only)ApplicationFast (cached)
__device__All threads (global)ApplicationSlow
(no qualifier)Thread (register/local)ThreadFastest
Keyword/FunctionWhat It Does
__syncthreads()Barrier — all threads in block must reach this point
__syncwarp(mask)Barrier for threads within a warp
__threadfence()Ensures memory writes visible to all threads (device)
__threadfence_block()Same but within block only
__threadfence_system()Visible to host + all devices
__constant__ int TABLE[64];                  // global scope

__device__ int helper(int x) { return x*2; } // device-only function

__global__ void kernel(int *d_data, int n) { // THE kernel
    __shared__ float cache[256];              // block-local fast mem

    int i = blockIdx.x * blockDim.x + threadIdx.x;  // built-ins
    
    cache[threadIdx.x] = d_data[i];
    __syncthreads();                          // sync

    atomicAdd(&d_data[0], 1);                // atomic

    unsigned mask = __ballot_sync(0xFFFFFFFF, i < n); // warp vote
}

Comments