Skip to content

Systems & Models

  • distributed system - collection of individual computing devices that can communicate with each other → each processor has semi-independent agenda
  • Difficult to construct due to: 1. asynchrony 2. limited local knowledge 3. failures

Message Passing Systems

  • processors communicate by sending messages over a bidirectional communication channel
  • this pattern of connections is called topology or network → represented as an undirected graph where
    • processors represent vertices and
    • connections are edges (when connection exists)
  • Algorithm for such system is a local program for each processor
  • configuration vector represents the state of each processor in the system, such that \(C = (q_0, q_1,...,q_n-1)\) contains state \(q_i\) of processor \(p_i\)
  • events (or actions) are operations that happen in the system, e.g. local computation or message delivery
  • execution models the behavior of system over time as a sequence of configurations and events → can be finite or infinite → must satisfy conditions used to represent correctness properties of the model
  • Categories of conditions:
    • safety - must hold in very finite prefix of any execution
    • liveliness - must hold a certain number of times (possibly infinite) eventually   → liveliness includes fairness condition: infinite executions contains infinitely many actions by a processor.

Asynchronous Systems

  • when there exists no upper bound on how long it takes for a message to be delivered or how must time elapses between processor steps, e.g. the Internet
  • admissible execution - each processor has an infinite number of computation events and every message is eventually delivered → assumes processors do not fail → there maybe many executions of the same algorithm
  • terminated state
    • each processor's state includes a set of such states
    • once processor enters such state it remains in it
    • algorithm terminates once all processors are in a terminated state
  • message complexity - maximum of total number of messages sent over all admissible executions
  • timed execution - associates a time (\({\rm I\!R}\)) with each event of when the event occurs
  • measuring time complexity:
    • assume that max. message delay is 1 unit of time
    • complexity is the maximum time until termination of all timed, admissible executions with each message having delay at most 1

Synchronous System

  • processors can send messages to neighbors and execution progresses in rounds
  • admissible execution - if it is infinite → every processor takes an infinite number of computation steps and every message is eventually delivered (no failures)
  • Unlike async system, once algorithm is fixed only initial configuration may change
  • Same definition of terminated states applies as asynchronous systems
  • Message complexity is the maximum number of total messages sent over all admissible executions
  • To measure time complexity count the number of rounds in any admissible execution until termination

Parallel Models

Multiple cooperating processors increase efficiency by: - increasing system throughput by performing tasks in parallel → systems for this are available as hardware/OS software - reducing response time by partitioning tasks between processors → realizing this is a challenge

→ to speed up computation requires there exist faster-than-sequential algorithms

PRAM

  • PRAM = Parallel Random Access Machine → set of synchronous processors with concurrent access to shared memory
  • Common features:
    • there are \(P\) processors each with unique PID
    • each processor has access to its PID and number of processors
    • shared memory is accessible to all processors and takes 1 unit of time
    • shared memory has \(Q\) cells and input of size \(N \leq Q\) is stored in first \(N\) cells
    • processors have access to input size \(N\)
    • memory cells \(> N\) are cleared and contain zeros
    • each processor has local, private memory
    • all memory cells can store up to \(\Theta(\lg max\{N,P\})\) bits
    • usually \(Q\) is polynomial compared to \(P\) and size of \(N\) and private memory is small
  • Processor instructions are defined in terms of 3 synchronous cycles:
    • read cycle - read from shared to private memory
    • compute cycle - performs computation using data in local memory
    • write cycle - write from local memory to shared memory
  • assume fault-free execution environment

Variants

  • EREW - exclusive read exclusive write - no concurrent access to same memory location
  • CREW - concurrent read, exclusive write - arbitrary concurrent reads from same memory location are allowed
  • CRCW - concurrect read, concurrect write - there are 3 write policies:
    1. Common CRCW - all concurrent writes must write the same value
    2. Arbitrary CRCW - the value to be written is chosen arbitrarily
    3. Priority CRCW - processor with the highest PID succeeds
  • EREW is weakest and Priority CRCW is strongest → algorithm that works on the weaker variant works in the stronger variant → stronger variants can be simulated on weaker variants at modest cost
  • IDEAL PRAM - sometimes used to show lower bounds without failures:
    • has no bounds on memory size and processor can perform arbitrary number of computations in unit time
    • also classified as EREW, CREW, CRCW IDEAL PRAM
  • Memory snapshot model
    • alternative model for showing lower bounds for fault-prone models
    • processors can read and process entire shared memory at unit cost

Advantages & Limitations

  • in parallel or distributed computation: no universally accepted high-level programming paradigm
  • at abstract level PRAM model comes close to satisfying the 3 conditions that made RAM successful:
    1. high level language for expressing complex algorithms
    2. machine architecture (von Neumann) that can be implemented efficiently
    3. efficient compilation to translate programs into machine code
  • implementing PRAM is not straightforward
    • some efficient simulations of PRAM exist, e.g. hybercube
    • sometimes mapped to specific target architecture
  • improvements in PRAM models often result in improvements in practical architectures

Measuring Parallelism & Efficiency

Note: more notes on this same topic here

  • \(T_1\) - time on single sequential processors
  • \(P\) - number of processors
  • \(T_p\) - time on \(P\) processors
  • \(T^*_1(N)\) - the most efficient sequential algorithm
Description
\(W = P \cdot T_p(N)\) work law
\(W = 1 \cdot T_1(N) = T_1(N)\) work law when \(P=1\)
\(W = P \cdot T_p(N) = \Theta(T^*_1(N))\) optimal parallel algo
\(W = \Theta(T^*_1(N) \cdot \log^{O(1)}N)\) efficient parallel algo
\(S = T^*_1(N) / T_p(N)\) speedup
\(S = \Theta(P)\) linear speedup (optimal)
\(S = \Theta(P/\log^{c}N)\) speedup of efficient algo
\(S = \Theta(P/N^{\varepsilon})\) polynomially efficient speedup
  • Always measure speedup \(S\) relative to the most efficient sequential algorithm
  • parallel algorithm is optimal when the parallel work on input \(N\) is related by multiplicative constant to \(T^*_1(N)\) → achieved only with linear speedup in \(P\)
  • For linear speedup need setting where processors operate in synchrony
  • Efficient parallel algorithm has speedup that is near linear in \(P\)
  • For polynomially efficient algorithm: \(0 < \varepsilon < 1\)
  • Using work to as main complexity measure is more strict than parallel time alone, algorithm is work-optimal if
    \(W(N, P) = P \cdot T_p(N) = O(W_{LB}(N, P))\)
  • Some tasks that are already doable as efficient or optimal parallel algorithms:
    • adding \(N\) integers
    • sorting \(N\) records
    • maintaining lists or trees of size \(N\)