GPU Paralellization

Cail Daley

THW, Nov 20 2019

Moore’s Law

By Max Roser - https://ourworldindata.org/uploads/2019/05/Transistor-Count-over-time-to-2018.png, CC BY-SA 4.0, Link

Dennard scaling

voltage drain, capacitance, inductance \(∝\) transistor size
clock frequency \(∝ 1 /\) transistor size
total power is the same!

Soham Chatterjee

The end of scaling

C. Moore, “Data Processing in Exascale-class Computing Systems,” presented at the Salishan Conference on High-Speed Computing, 2011.

CPUs: Latency Oriented

latency is lag a computer instruction and its completion

CPUs use all kind of complicated tricks to minimize latency

GPUs: Throughput Oriented

throughput is number of operations per unit time

GPUs maximize throughput at the cost of latency

Throughput \(×\) Latency = Queue Size

tasks can be sensitive to latency…

serial tasks
- sequential or iterative calculations

or throughput

pleasingly/embarrassingly parallel tasks
- calculations are independent of one another

GPU anatomy

three levels of organization:

GPUs contain many small “threads” capable of performing calculations
- each thread has a little bit of memory and a threadIdx (1, 2, or 3D)
threads are grouped into “blocks”
- each block has some shared memory and a blockIdx (1, 2, or 3D)
blocks live on grids

Example: Image Blurring

Shared memory matrix multiply

Thanks!