GPU architecture

GPU

GPU or graphics processing unit, are type of processors that are used to perform large floating point instructions. It gained intial popularity for it’s use in video processing or gaming and more recently for training large language models.

But why are GPUs so fast and are they always faster than CPUs? The answer lies in the architecture. A GPU is designed to perform single instruction over multiple data acronymed SIMD. A CPU usually contains Control Unit, Arithmatic Logic Unit, registers etc. More recently CPU have multiple cores, each cpu works like a mini cpu with it’s own CU and ALU. If a CPU has 4 cores, it can operate 4 instruction truly in parallel. GPU are different, they can 1000s of threads in parallel.

A Gpu consists of streaming multiprocessors (SM), each SM is divided into warps (32 threads), each warp perform a single instruction but over multiple data. This makes performing matrix multiplication of any linear algebraic operation over large data very fast. That is esentially what we need to do when training a large language model. Yes our AI is just large number of matrix multiplications.

The bottlneck for GPU is not ability to perform parallel operation but moving data. Each GPU has a global memory and then a shared local memory to each SM. Data first needs to move from RAM to global memory or from host to device for GPU to operate on it. Each warp moves data from global memory to it’s closer register. So when writing cuda programs we have be conservative about memory movement and that rules how fast the program actually runs.

GPU#

GPU