Multithreaded Algorithms

Summarized notes on Introduction to Algorithms, Chapter 27

General Notes

parallel algorithms run on multiprocessor computers → multiple instructions can execute concurrently
parallel computer has many processors → different types:
- chip multiprocessors - single multicore chip with multiple cores
- clusters - built from individual computers with interconnected network
- supercomputers
no parallel model has widespread acceptance
- shared memory each processor can directly access any location of memory
- distributed memory processors memory is private; messaging used to communicate between processors
- trend toward shared memory model
static threading - threads sharing common memory
- threads execute independent of other threads
- threads usually exist for duration of the computation
- this is difficult and error-prone
concurrency platforms provide software layer for coordinating, scheduling, and managing parallel computing resources
- include dynamic multithreading: enable parallelism in applications without worrying about load balancing and other issues associated with static-thread programming
- concurrency platform includes a scheduler which load-balances computation automatically
- two common features: nested parallelism and parallel loops
- nested parallelism: spawn child thread while parent continues execution
- parallel loops: iterations execute concurrently

1. Dynamic multithreading

serialization of a multithreaded algorithm → delete keywords spawn, sync and parallel
nested parallelism - when spawn precedes a procedure call → parent may continue to execute parallel to child
keyword spawn only indicates logical parallelism → scheduler will determine actual execution
sync is used to wait for al spawned children to finish computation

computation DAG - representation of multithreaded computation as a graph
- vertices are instructions
- edges are dependencies between instructions, i.e. $(u, v) \in E$ means $u$ must execute before $v$
- strand is a sequence of instructions with no parallel control keywords
- if $G$ has directed path from strands $u$ to strand $v$ they are in series → otherwise $u$ and $v$ are (logically) in parallel
classification of edges in a computation DAG
- continuation edge (horizontal) connects $u$ to its successor $u^{'}$
- spawn edge (downward with horizontal continuation edge) when $u$ spawns a strand $v$
- call edge (downward) normal procedure call
- return edge (upward) when $u$ returns to its calling procedure, and $x$ is next strand: $(u, x)$
- initial strand single starting point of computation
- final strand single ending point of computation

work total time to execute the entire computation on one processor
span the longest time to execute the strands along any path of the dag → assume unit time then span is the number of vertices on the critical path of the DAG
speedup how many times faster the computation is on $P$ processors than on 1 processor → at most $P$
linear speedup when speedup increases linearly wrt. number of processors
parallelism can be viewed in 3 ways
- average amount of work along each step of critical path
- upper bound on maximum possible speedup
- limit on possible attaining perfect linear speedup → once number of processors exceeds parallelism cannot achieve perfect linear speedup
slackness factor by which parallelism of computation exceeds number of processors in the machine
- if $< 1$ cannot achieve perfect linear speedup
- if $> 1$ work per processor is limiting factor
concurrency platform's scheduler determines execution in practice → operates on-line without advance knowledge of when threads will be spawned or complete
greedy scheduler assigns as many strands to processors as possible in each time step
complete step if at least $P$ strands are ready to execute in a time step
incomplete step when fewer than $P$ strands are ready to execute
The running time $T_{p}$ of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.

Running time
$T_{P} \leq \frac{T_{1}}{P} + T_{\infty}$	ideal parallel computer w/ greedy scheduler

pseudocode

deterministic algorithm performs same operations on same input independent of scheduling
nondeterministic when behavior may cary between executions → typically caused by race condition
determinacy race when 2 or more instructions access the same memory location and at least one performs write operation
strands are independent when they have no determinacy races