Multithreaded Algorithms
Summarized notes on Introduction to Algorithms, Chapter 27
General Notes
- parallel algorithms run on multiprocessor computers → multiple instructions can execute concurrently
- parallel computer has many processors → different types:
- chip multiprocessors - single multicore chip with multiple cores
- clusters - built from individual computers with interconnected network
- supercomputers
- no parallel model has widespread acceptance
- shared memory each processor can directly access any location of memory
- distributed memory processors memory is private; messaging used to communicate between processors
- trend toward shared memory model
- static threading - threads sharing common memory
- threads execute independent of other threads
- threads usually exist for duration of the computation
- this is difficult and error-prone
- concurrency platforms provide software layer for coordinating, scheduling, and managing parallel computing resources
- include dynamic multithreading: enable parallelism in applications without worrying about load balancing and other issues associated with static-thread programming
- concurrency platform includes a scheduler which load-balances computation automatically
- two common features: nested parallelism and parallel loops
- nested parallelism: spawn child thread while parent continues execution
- parallel loops: iterations execute concurrently
1. Dynamic multithreading
- serialization of a multithreaded algorithm → delete keywords spawn, sync and parallel
- nested parallelism - when spawn precedes a procedure call → parent may continue to execute parallel to child
- keyword spawn only indicates logical parallelism → scheduler will determine actual execution
- sync is used to wait for al spawned children to finish computation
Computation DAG
-
computation DAG - representation of multithreaded computation as a graph
- vertices are instructions
- edges are dependencies between instructions, i.e. \((u, v) \in E\) means \(u\) must execute before \(v\)
- strand is a sequence of instructions with no parallel control keywords
- if \(G\) has directed path from strands \(u\) to strand \(v\) they are in series → otherwise \(u\) and \(v\) are (logically) in parallel
-
classification of edges in a computation DAG
- continuation edge (horizontal) connects \(u\) to its successor \(u'\)
- spawn edge (downward with horizontal continuation edge) when \(u\) spawns a strand \(v\)
- call edge (downward) normal procedure call
- return edge (upward) when \(u\) returns to its calling procedure, and \(x\) is next strand: \((u, x)\)
- initial strand single starting point of computation
- final strand single ending point of computation
Performance measures
Equation | Meaning |
---|---|
\(T_P \geq T_1/P\) | work law |
\(T_P \geq T_{\infty}\) | span law |
\(T_1/T_P\) | speedup |
\(T_1/T_P = \Theta(P)\) | linear speedup |
\(T_1/T_P = P\) | perfect linear speedup |
\(T_1/T_{\infty}\) | parallelism of an algorithm |
\((T_1/T_{\infty}) / P = T_1/(PT_{\infty})\) | slackness |
- work total time to execute the entire computation on one processor
- span the longest time to execute the strands along any path of the dag → assume unit time then span is the number of vertices on the critical path of the DAG
- speedup how many times faster the computation is on \(P\) processors than on 1 processor → at most \(P\)
- linear speedup when speedup increases linearly wrt. number of processors
- parallelism can be viewed in 3 ways
- average amount of work along each step of critical path
- upper bound on maximum possible speedup
- limit on possible attaining perfect linear speedup → once number of processors exceeds parallelism cannot achieve perfect linear speedup
-
slackness factor by which parallelism of computation exceeds number of processors in the machine
- if \(< 1\) cannot achieve perfect linear speedup
- if \(> 1\) work per processor is limiting factor
-
concurrency platform's scheduler determines execution in practice → operates on-line without advance knowledge of when threads will be spawned or complete
- greedy scheduler assigns as many strands to processors as possible in each time step
- complete step if at least \(P\) strands are ready to execute in a time step
- incomplete step when fewer than \(P\) strands are ready to execute
- The running time \(T_p\) of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.
Running time | |
---|---|
\(T_P \leq \frac{T_1}{P} + T_{\infty}\) | ideal parallel computer w/ greedy scheduler |
Parallel loops
pseudocode
Race Conditions
- deterministic algorithm performs same operations on same input independent of scheduling
- nondeterministic when behavior may cary between executions → typically caused by race condition
- determinacy race when 2 or more instructions access the same memory location and at least one performs write operation
- strands are independent when they have no determinacy races