GPU Paralellization
Cail Daley
THW, Nov 20 2019
Moore’s Law
By Max Roser -
https://ourworldindata.org/uploads/2019/05/Transistor-Count-over-time-to-2018.png
,
CC BY-SA 4.0
,
Link
Dennard scaling
voltage drain, capacitance, inductance
\(∝\)
transistor size
clock frequency
\(∝ 1 /\)
transistor size
total power is the same!
Soham Chatterjee
The end of scaling
C. Moore, “Data Processing in Exascale-class Computing Systems,” presented at the Salishan Conference on High-Speed Computing, 2011.
CPUs: Latency Oriented
latency
is lag a computer instruction and its completion
CPU
s use all kind of complicated tricks to minimize latency
GPUs: Throughput Oriented
throughput
is number of operations per unit time
GPUs
maximize throughput at the cost of latency
Throughput
\(×\)
Latency = Queue Size
tasks can be sensitive to latency…
serial tasks
sequential or iterative calculations
or throughput
pleasingly/embarrassingly parallel tasks
calculations are independent of one another
GPU anatomy
three levels of organization:
GPUs contain many small “threads” capable of performing calculations
each thread has a little bit of memory and a
threadIdx
(1, 2, or 3D)
threads are grouped into “blocks”
each block has some shared memory and a
blockIdx
(1, 2, or 3D)
blocks live on grids
Example: Image Blurring
Shared memory matrix multiply
Thanks!