Skip to content

Multithreaded Algorithms

Summarized notes on Introduction to Algorithms, Chapter 27

General Notes

  • parallel algorithms run on multiprocessor computers → multiple instructions can execute concurrently
  • parallel computer has many processors → different types:
    • chip multiprocessors - single multicore chip with multiple cores
    • clusters - built from individual computers with interconnected network
    • supercomputers
  • no parallel model has widespread acceptance
    • shared memory each processor can directly access any location of memory
    • distributed memory processors memory is private; messaging used to communicate between processors
    • trend toward shared memory model
  • static threading - threads sharing common memory
    • threads execute independent of other threads
    • threads usually exist for duration of the computation
    • this is difficult and error-prone
  • concurrency platforms provide software layer for coordinating, scheduling, and managing parallel computing resources
    • include dynamic multithreading: enable parallelism in applications without worrying about load balancing and other issues associated with static-thread programming
    • concurrency platform includes a scheduler which load-balances computation automatically
    • two common features: nested parallelism and parallel loops
    • nested parallelism: spawn child thread while parent continues execution
    • parallel loops: iterations execute concurrently

1. Dynamic multithreading

  • serialization of a multithreaded algorithm → delete keywords spawn, sync and parallel
  • nested parallelism - when spawn precedes a procedure call → parent may continue to execute parallel to child
  • keyword spawn only indicates logical parallelism → scheduler will determine actual execution
  • sync is used to wait for al spawned children to finish computation

Computation DAG

  • computation DAG - representation of multithreaded computation as a graph

    • vertices are instructions
    • edges are dependencies between instructions, i.e. \((u, v) \in E\) means \(u\) must execute before \(v\)
    • strand is a sequence of instructions with no parallel control keywords
    • if \(G\) has directed path from strands \(u\) to strand \(v\) they are in series → otherwise \(u\) and \(v\) are (logically) in parallel
  • classification of edges in a computation DAG

    • continuation edge (horizontal) connects \(u\) to its successor \(u'\)
    • spawn edge (downward with horizontal continuation edge) when \(u\) spawns a strand \(v\)
    • call edge (downward) normal procedure call
    • return edge (upward) when \(u\) returns to its calling procedure, and \(x\) is next strand: \((u, x)\)
    • initial strand single starting point of computation
    • final strand single ending point of computation

img

Performance measures

Equation Meaning
\(T_P \geq T_1/P\) work law
\(T_P \geq T_{\infty}\) span law
\(T_1/T_P\) speedup
\(T_1/T_P = \Theta(P)\) linear speedup
\(T_1/T_P = P\) perfect linear speedup
\(T_1/T_{\infty}\) parallelism of an algorithm
\((T_1/T_{\infty}) / P = T_1/(PT_{\infty})\) slackness
  • work total time to execute the entire computation on one processor
  • span the longest time to execute the strands along any path of the dag → assume unit time then span is the number of vertices on the critical path of the DAG
  • speedup how many times faster the computation is on \(P\) processors than on 1 processor → at most \(P\)
  • linear speedup when speedup increases linearly wrt. number of processors
  • parallelism can be viewed in 3 ways
    • average amount of work along each step of critical path
    • upper bound on maximum possible speedup
    • limit on possible attaining perfect linear speedup → once number of processors exceeds parallelism cannot achieve perfect linear speedup
  • slackness factor by which parallelism of computation exceeds number of processors in the machine

    • if \(< 1\) cannot achieve perfect linear speedup
    • if \(> 1\) work per processor is limiting factor
  • concurrency platform's scheduler determines execution in practice → operates on-line without advance knowledge of when threads will be spawned or complete

  • greedy scheduler assigns as many strands to processors as possible in each time step
  • complete step if at least \(P\) strands are ready to execute in a time step
  • incomplete step when fewer than \(P\) strands are ready to execute
  • The running time \(T_p\) of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.
Running time
\(T_P \leq \frac{T_1}{P} + T_{\infty}\) ideal parallel computer w/ greedy scheduler

Parallel loops

pseudocode

img

img

Race Conditions

  • deterministic algorithm performs same operations on same input independent of scheduling
  • nondeterministic when behavior may cary between executions → typically caused by race condition
  • determinacy race when 2 or more instructions access the same memory location and at least one performs write operation
  • strands are independent when they have no determinacy races