Skip to content

Systems & Models

  • distributed system - collection of individual computing devices that can communicate with each other → each processor has semi-independent agenda
  • Difficult to construct due to: 1. asynchrony 2. limited local knowledge 3. failures

Message Passing Systems

  • processors communicate by sending messages over a bidirectional communication channel
  • this pattern of connections is called topology or network → represented as an undirected graph where
    • processors represent vertices and
    • connections are edges (when connection exists)
  • Algorithm for such system is a local program for each processor
  • configuration vector represents the state of each processor in the system, such that C=(q0,q1,...,qn1) contains state qi of processor pi
  • events (or actions) are operations that happen in the system, e.g. local computation or message delivery
  • execution models the behavior of system over time as a sequence of configurations and events → can be finite or infinite → must satisfy conditions used to represent correctness properties of the model
  • Categories of conditions:
    • safety - must hold in very finite prefix of any execution
    • liveliness - must hold a certain number of times (possibly infinite) eventually   → liveliness includes fairness condition: infinite executions contains infinitely many actions by a processor.

Asynchronous Systems

  • when there exists no upper bound on how long it takes for a message to be delivered or how must time elapses between processor steps, e.g. the Internet
  • admissible execution - each processor has an infinite number of computation events and every message is eventually delivered → assumes processors do not fail → there maybe many executions of the same algorithm
  • terminated state
    • each processor's state includes a set of such states
    • once processor enters such state it remains in it
    • algorithm terminates once all processors are in a terminated state
  • message complexity - maximum of total number of messages sent over all admissible executions
  • timed execution - associates a time (IR) with each event of when the event occurs
  • measuring time complexity:
    • assume that max. message delay is 1 unit of time
    • complexity is the maximum time until termination of all timed, admissible executions with each message having delay at most 1

Synchronous System

  • processors can send messages to neighbors and execution progresses in rounds
  • admissible execution - if it is infinite → every processor takes an infinite number of computation steps and every message is eventually delivered (no failures)
  • Unlike async system, once algorithm is fixed only initial configuration may change
  • Same definition of terminated states applies as asynchronous systems
  • Message complexity is the maximum number of total messages sent over all admissible executions
  • To measure time complexity count the number of rounds in any admissible execution until termination

Parallel Models

Multiple cooperating processors increase efficiency by: - increasing system throughput by performing tasks in parallel → systems for this are available as hardware/OS software - reducing response time by partitioning tasks between processors → realizing this is a challenge

→ to speed up computation requires there exist faster-than-sequential algorithms

PRAM

  • PRAM = Parallel Random Access Machine → set of synchronous processors with concurrent access to shared memory
  • Common features:
    • there are P processors each with unique PID
    • each processor has access to its PID and number of processors
    • shared memory is accessible to all processors and takes 1 unit of time
    • shared memory has Q cells and input of size NQ is stored in first N cells
    • processors have access to input size N
    • memory cells >N are cleared and contain zeros
    • each processor has local, private memory
    • all memory cells can store up to Θ(lgmax{N,P}) bits
    • usually Q is polynomial compared to P and size of N and private memory is small
  • Processor instructions are defined in terms of 3 synchronous cycles:
    • read cycle - read from shared to private memory
    • compute cycle - performs computation using data in local memory
    • write cycle - write from local memory to shared memory
  • assume fault-free execution environment

Variants

  • EREW - exclusive read exclusive write - no concurrent access to same memory location
  • CREW - concurrent read, exclusive write - arbitrary concurrent reads from same memory location are allowed
  • CRCW - concurrect read, concurrect write - there are 3 write policies:
    1. Common CRCW - all concurrent writes must write the same value
    2. Arbitrary CRCW - the value to be written is chosen arbitrarily
    3. Priority CRCW - processor with the highest PID succeeds
  • EREW is weakest and Priority CRCW is strongest → algorithm that works on the weaker variant works in the stronger variant → stronger variants can be simulated on weaker variants at modest cost
  • IDEAL PRAM - sometimes used to show lower bounds without failures:
    • has no bounds on memory size and processor can perform arbitrary number of computations in unit time
    • also classified as EREW, CREW, CRCW IDEAL PRAM
  • Memory snapshot model
    • alternative model for showing lower bounds for fault-prone models
    • processors can read and process entire shared memory at unit cost

Advantages & Limitations

  • in parallel or distributed computation: no universally accepted high-level programming paradigm
  • at abstract level PRAM model comes close to satisfying the 3 conditions that made RAM successful:
    1. high level language for expressing complex algorithms
    2. machine architecture (von Neumann) that can be implemented efficiently
    3. efficient compilation to translate programs into machine code
  • implementing PRAM is not straightforward
    • some efficient simulations of PRAM exist, e.g. hybercube
    • sometimes mapped to specific target architecture
  • improvements in PRAM models often result in improvements in practical architectures

Measuring Parallelism & Efficiency

Note: more notes on this same topic here

  • T1 - time on single sequential processors
  • P - number of processors
  • Tp - time on P processors
  • T1(N) - the most efficient sequential algorithm
Description
W=PTp(N) work law
W=1T1(N)=T1(N) work law when P=1
W=PTp(N)=Θ(T1(N)) optimal parallel algo
W=Θ(T1(N)logO(1)N) efficient parallel algo
S=T1(N)/Tp(N) speedup
S=Θ(P) linear speedup (optimal)
S=Θ(P/logcN) speedup of efficient algo
S=Θ(P/Nε) polynomially efficient speedup
  • Always measure speedup S relative to the most efficient sequential algorithm
  • parallel algorithm is optimal when the parallel work on input N is related by multiplicative constant to T1(N) → achieved only with linear speedup in P
  • For linear speedup need setting where processors operate in synchrony
  • Efficient parallel algorithm has speedup that is near linear in P
  • For polynomially efficient algorithm: 0<ε<1
  • Using work to as main complexity measure is more strict than parallel time alone, algorithm is work-optimal if
    W(N,P)=PTp(N)=O(WLB(N,P))
  • Some tasks that are already doable as efficient or optimal parallel algorithms:
    • adding N integers
    • sorting N records
    • maintaining lists or trees of size N