Systems & Models

distributed system - collection of individual computing devices that can communicate with each other → each processor has semi-independent agenda
Difficult to construct due to: 1. asynchrony 2. limited local knowledge 3. failures

Message Passing Systems

processors communicate by sending messages over a bidirectional communication channel
this pattern of connections is called topology or network → represented as an undirected graph where
- processors represent vertices and
- connections are edges (when connection exists)
Algorithm for such system is a local program for each processor
configuration vector represents the state of each processor in the system, such that \(C = (q_0, q_1,...,q_n-1)\) contains state \(q_i\) of processor \(p_i\)
events (or actions) are operations that happen in the system, e.g. local computation or message delivery
execution models the behavior of system over time as a sequence of configurations and events → can be finite or infinite → must satisfy conditions used to represent correctness properties of the model
Categories of conditions:
- safety - must hold in very finite prefix of any execution
- liveliness - must hold a certain number of times (possibly infinite) eventually → liveliness includes fairness condition: infinite executions contains infinitely many actions by a processor.

Asynchronous Systems

when there exists no upper bound on how long it takes for a message to be delivered or how must time elapses between processor steps, e.g. the Internet
admissible execution - each processor has an infinite number of computation events and every message is eventually delivered → assumes processors do not fail → there maybe many executions of the same algorithm
terminated state
- each processor's state includes a set of such states
- once processor enters such state it remains in it
- algorithm terminates once all processors are in a terminated state
message complexity - maximum of total number of messages sent over all admissible executions
timed execution - associates a time (\({\rm I\!R}\)) with each event of when the event occurs
measuring time complexity:
- assume that max. message delay is 1 unit of time
- complexity is the maximum time until termination of all timed, admissible executions with each message having delay at most 1

Synchronous System

processors can send messages to neighbors and execution progresses in rounds
admissible execution - if it is infinite → every processor takes an infinite number of computation steps and every message is eventually delivered (no failures)
Unlike async system, once algorithm is fixed only initial configuration may change
Same definition of terminated states applies as asynchronous systems
Message complexity is the maximum number of total messages sent over all admissible executions
To measure time complexity count the number of rounds in any admissible execution until termination

Parallel Models

Multiple cooperating processors increase efficiency by: - increasing system throughput by performing tasks in parallel → systems for this are available as hardware/OS software - reducing response time by partitioning tasks between processors → realizing this is a challenge

→ to speed up computation requires there exist faster-than-sequential algorithms

PRAM

PRAM = Parallel Random Access Machine → set of synchronous processors with concurrent access to shared memory
Common features:
- there are \(P\) processors each with unique PID
- each processor has access to its PID and number of processors
- shared memory is accessible to all processors and takes 1 unit of time
- shared memory has \(Q\) cells and input of size \(N \leq Q\) is stored in first \(N\) cells
- processors have access to input size \(N\)
- memory cells \(> N\) are cleared and contain zeros
- each processor has local, private memory
- all memory cells can store up to \(\Theta(\lg max\{N,P\})\) bits
- usually \(Q\) is polynomial compared to \(P\) and size of \(N\) and private memory is small
Processor instructions are defined in terms of 3 synchronous cycles:
- read cycle - read from shared to private memory
- compute cycle - performs computation using data in local memory
- write cycle - write from local memory to shared memory
assume fault-free execution environment

Variants

EREW - exclusive read exclusive write - no concurrent access to same memory location
CREW - concurrent read, exclusive write - arbitrary concurrent reads from same memory location are allowed
CRCW - concurrect read, concurrect write - there are 3 write policies:
1. Common CRCW - all concurrent writes must write the same value
2. Arbitrary CRCW - the value to be written is chosen arbitrarily
3. Priority CRCW - processor with the highest PID succeeds
EREW is weakest and Priority CRCW is strongest → algorithm that works on the weaker variant works in the stronger variant → stronger variants can be simulated on weaker variants at modest cost
IDEAL PRAM - sometimes used to show lower bounds without failures:
- has no bounds on memory size and processor can perform arbitrary number of computations in unit time
- also classified as EREW, CREW, CRCW IDEAL PRAM
Memory snapshot model
- alternative model for showing lower bounds for fault-prone models
- processors can read and process entire shared memory at unit cost

Advantages & Limitations

in parallel or distributed computation: no universally accepted high-level programming paradigm
at abstract level PRAM model comes close to satisfying the 3 conditions that made RAM successful:
1. high level language for expressing complex algorithms
2. machine architecture (von Neumann) that can be implemented efficiently
3. efficient compilation to translate programs into machine code
implementing PRAM is not straightforward
- some efficient simulations of PRAM exist, e.g. hybercube
- sometimes mapped to specific target architecture
improvements in PRAM models often result in improvements in practical architectures

Measuring Parallelism & Efficiency

Note: more notes on this same topic here

\(T_1\) - time on single sequential processors
\(P\) - number of processors
\(T_p\) - time on \(P\) processors
\(T^*_1(N)\) - the most efficient sequential algorithm

	Description
\(W = P \cdot T_p(N)\)	work law
\(W = 1 \cdot T_1(N) = T_1(N)\)	work law when \(P=1\)
\(W = P \cdot T_p(N) = \Theta(T^*_1(N))\)	optimal parallel algo
\(W = \Theta(T^*_1(N) \cdot \log^{O(1)}N)\)	efficient parallel algo
\(S = T^*_1(N) / T_p(N)\)	speedup
\(S = \Theta(P)\)	linear speedup (optimal)
\(S = \Theta(P/\log^{c}N)\)	speedup of efficient algo
\(S = \Theta(P/N^{\varepsilon})\)	polynomially efficient speedup

Always measure speedup \(S\) relative to the most efficient sequential algorithm
parallel algorithm is optimal when the parallel work on input \(N\) is related by multiplicative constant to \(T^*_1(N)\) → achieved only with linear speedup in \(P\)
For linear speedup need setting where processors operate in synchrony
Efficient parallel algorithm has speedup that is near linear in \(P\)
For polynomially efficient algorithm: \(0 < \varepsilon < 1\)
Using work to as main complexity measure is more strict than parallel time alone, algorithm is work-optimal if
\(W(N, P) = P \cdot T_p(N) = O(W_{LB}(N, P))\)
Some tasks that are already doable as efficient or optimal parallel algorithms:
- adding \(N\) integers
- sorting \(N\) records
- maintaining lists or trees of size \(N\)