GPU Paralellization

Cail Daley

THW, Nov 20 2019

Moore’s Law

By Max Roser - https://ourworldindata.org/uploads/2019/05/Transistor-Count-over-time-to-2018.png, CC BY-SA 4.0, Link

Dennard scaling



  • voltage drain, capacitance, inductance \(∝\) transistor size
  • clock frequency \(∝ 1 /\) transistor size
  • total power is the same!

The end of scaling

C. Moore, “Data Processing in Exascale-class Computing Systems,” presented at the Salishan Conference on High-Speed Computing, 2011.

CPUs: Latency Oriented

  • latency is lag a computer instruction and its completion

  • CPUs use all kind of complicated tricks to minimize latency

GPUs: Throughput Oriented

  • throughput is number of operations per unit time

  • GPUs maximize throughput at the cost of latency

Throughput \(×\) Latency = Queue Size

tasks can be sensitive to latency…

  • serial tasks
    • sequential or iterative calculations

or throughput

  • pleasingly/embarrassingly parallel tasks
    • calculations are independent of one another

GPU anatomy



three levels of organization:

  • GPUs contain many small “threads” capable of performing calculations
    • each thread has a little bit of memory and a threadIdx (1, 2, or 3D)
  • threads are grouped into “blocks”
    • each block has some shared memory and a blockIdx (1, 2, or 3D)
  • blocks live on grids

Example: Image Blurring







Shared memory matrix multiply



Thanks!