Systems & Models
- distributed system - collection of individual computing devices that can communicate with each other → each processor has semi-independent agenda
- Difficult to construct due to: 1. asynchrony 2. limited local knowledge 3. failures
Message Passing Systems
- processors communicate by sending messages over a bidirectional communication channel
- this pattern of connections is called topology or network → represented as an undirected graph
where
- processors represent vertices and
- connections are edges (when connection exists)
- Algorithm for such system is a local program for each processor
- configuration vector represents the state of each processor in the system, such that \(C = (q_0, q_1,...,q_n-1)\) contains state \(q_i\) of processor \(p_i\)
- events (or actions) are operations that happen in the system, e.g. local computation or message delivery
- execution models the behavior of system over time as a sequence of configurations and events → can be finite or infinite → must satisfy conditions used to represent correctness properties of the model
- Categories of conditions:
- safety - must hold in very finite prefix of any execution
- liveliness - must hold a certain number of times (possibly infinite) eventually → liveliness includes fairness condition: infinite executions contains infinitely many actions by a processor.
Asynchronous Systems
- when there exists no upper bound on how long it takes for a message to be delivered or how must time elapses between processor steps, e.g. the Internet
- admissible execution - each processor has an infinite number of computation events and every message is eventually delivered → assumes processors do not fail → there maybe many executions of the same algorithm
- terminated state
- each processor's state includes a set of such states
- once processor enters such state it remains in it
- algorithm terminates once all processors are in a terminated state
- message complexity - maximum of total number of messages sent over all admissible executions
- timed execution - associates a time (\({\rm I\!R}\)) with each event of when the event occurs
- measuring time complexity:
- assume that max. message delay is 1 unit of time
- complexity is the maximum time until termination of all timed, admissible executions with each message having delay at most 1
Synchronous System
- processors can send messages to neighbors and execution progresses in rounds
- admissible execution - if it is infinite → every processor takes an infinite number of computation steps and every message is eventually delivered (no failures)
- Unlike async system, once algorithm is fixed only initial configuration may change
- Same definition of terminated states applies as asynchronous systems
- Message complexity is the maximum number of total messages sent over all admissible executions
- To measure time complexity count the number of rounds in any admissible execution until termination
Parallel Models
Multiple cooperating processors increase efficiency by: - increasing system throughput by performing tasks in parallel → systems for this are available as hardware/OS software - reducing response time by partitioning tasks between processors → realizing this is a challenge
→ to speed up computation requires there exist faster-than-sequential algorithms
PRAM
- PRAM = Parallel Random Access Machine → set of synchronous processors with concurrent access to shared memory
- Common features:
- there are \(P\) processors each with unique PID
- each processor has access to its PID and number of processors
- shared memory is accessible to all processors and takes 1 unit of time
- shared memory has \(Q\) cells and input of size \(N \leq Q\) is stored in first \(N\) cells
- processors have access to input size \(N\)
- memory cells \(> N\) are cleared and contain zeros
- each processor has local, private memory
- all memory cells can store up to \(\Theta(\lg max\{N,P\})\) bits
- usually \(Q\) is polynomial compared to \(P\) and size of \(N\) and private memory is small
- Processor instructions are defined in terms of 3 synchronous cycles:
- read cycle - read from shared to private memory
- compute cycle - performs computation using data in local memory
- write cycle - write from local memory to shared memory
- assume fault-free execution environment
Variants
- EREW - exclusive read exclusive write - no concurrent access to same memory location
- CREW - concurrent read, exclusive write - arbitrary concurrent reads from same memory location are allowed
- CRCW - concurrect read, concurrect write - there are 3 write policies:
- Common CRCW - all concurrent writes must write the same value
- Arbitrary CRCW - the value to be written is chosen arbitrarily
- Priority CRCW - processor with the highest PID succeeds
- EREW is weakest and Priority CRCW is strongest → algorithm that works on the weaker variant works in the stronger variant → stronger variants can be simulated on weaker variants at modest cost
- IDEAL PRAM - sometimes used to show lower bounds without failures:
- has no bounds on memory size and processor can perform arbitrary number of computations in unit time
- also classified as EREW, CREW, CRCW IDEAL PRAM
- Memory snapshot model
- alternative model for showing lower bounds for fault-prone models
- processors can read and process entire shared memory at unit cost
Advantages & Limitations
- in parallel or distributed computation: no universally accepted high-level programming paradigm
- at abstract level PRAM model comes close to satisfying the 3 conditions that made RAM successful:
- high level language for expressing complex algorithms
- machine architecture (von Neumann) that can be implemented efficiently
- efficient compilation to translate programs into machine code
- implementing PRAM is not straightforward
- some efficient simulations of PRAM exist, e.g. hybercube
- sometimes mapped to specific target architecture
- improvements in PRAM models often result in improvements in practical architectures
Measuring Parallelism & Efficiency
Note: more notes on this same topic here
- \(T_1\) - time on single sequential processors
- \(P\) - number of processors
- \(T_p\) - time on \(P\) processors
- \(T^*_1(N)\) - the most efficient sequential algorithm
Description | |
---|---|
\(W = P \cdot T_p(N)\) | work law |
\(W = 1 \cdot T_1(N) = T_1(N)\) | work law when \(P=1\) |
\(W = P \cdot T_p(N) = \Theta(T^*_1(N))\) | optimal parallel algo |
\(W = \Theta(T^*_1(N) \cdot \log^{O(1)}N)\) | efficient parallel algo |
\(S = T^*_1(N) / T_p(N)\) | speedup |
\(S = \Theta(P)\) | linear speedup (optimal) |
\(S = \Theta(P/\log^{c}N)\) | speedup of efficient algo |
\(S = \Theta(P/N^{\varepsilon})\) | polynomially efficient speedup |
- Always measure speedup \(S\) relative to the most efficient sequential algorithm
- parallel algorithm is optimal when the parallel work on input \(N\) is related by multiplicative constant to \(T^*_1(N)\) → achieved only with linear speedup in \(P\)
- For linear speedup need setting where processors operate in synchrony
- Efficient parallel algorithm has speedup that is near linear in \(P\)
- For polynomially efficient algorithm: \(0 < \varepsilon < 1\)
- Using work to as main complexity measure is more strict than parallel time alone,
algorithm is work-optimal if
\(W(N, P) = P \cdot T_p(N) = O(W_{LB}(N, P))\) - Some tasks that are already doable as efficient or optimal parallel algorithms:
- adding \(N\) integers
- sorting \(N\) records
- maintaining lists or trees of size \(N\)