Skip to content

23. Architectures

Mar 31, 2021

Hypercube Simulations

  • \(L = \Theta(\lg P)\)
  • \(T_{PRAM} = \Theta(\lg P)\)
  • \(W_{PRAM} = \Theta(P \cdot \lg P)\)
  • \(T_{Simulation} = \Theta(\lg^{2} P)\)
  • \(W_{Simulation} = \Theta(P \cdot \lg^{2} P)\)

Associative operation on a hypercube

Consider an associative operation e.g. \(\sum x\) on a PRAM: oblivious simulation degrades by \(\log P\) factor.

There is better an oblivious simulation for this:

  • send own value to specific dimension neighbor
  • neighbor adds to own value

img

  • Step 1: send left to right, send in increasing order:

    • 011111: \(X_3 + X_7\)
    • 001101: \(X_1 + X_5\)
    • 010110: \(X_2 + X_6\)
    • 000100: \(X_0 + X_4\)
  • Step 2 repeat operation, but now in y dimension

    • those nodes that already contributed their value do not need to participate in the computation
    • 100110: \(X_0 + X_4 + X_2 + X_6\)
    • 101111: \(X_1 + X_5 + X_3 + X_7\)
  • Step 3 only two nodes participate and add values in z-dimension

    • result 110111: \(X_0 + X_1 + X_2 + X_3 + X_4 + X_5 + X_6 + X_7\)

The efficiency of this simulation is \(\lg P\), which is the same as the original algorithm → there is no loss in efficiency in this simulation.

Broadcast on a hypercube

Broadcast is another operation that is good on a hypercube

  • within a hypercube, there exists an embedded tree structure
  • there are many ways to use and interpret the architecture, here are two
    1. leaves are processors, the rest are switches
    2. all are processors
  • hypercube allows simulating log-tree algorithms without overhead
  • the weak point of this strategy is that root can become a bottleneck if all left nodes want to concurrently communicate with right side nodes → address such bottleneck by using a fat tree

Fat Tree

"fat tree"

  • make the fat tree strong node (center) the tree root
  • less powerful nodes on sides with strong connecting wires
  • nodes weaken over distance
  • nodes can be understood as switches
  • this architecture can be packed tightly on a plane and reduces the impact of bottleneck root, while preserving the tree structure
  • it is easy to build and can be layered vertically also

Tree-connected grid / grid-connected trees

Consider 2D grid architecture:

img

we are trying to look at each row and "plant a tree" over this row → plant a new tree over every row and column:

img

Message path is from node A to node B:

  • if message path < \(\log\): send along tree
  • if message path > \(\log\): send along the grid, do not overload the root.
  • \(> \log\) implies we have to cross the root
  • dot he same across every row and column

If there is not too much traffic, can reduce latency from \(P^2\) to \(\log P\)

  • in worst case may not help
  • amortized cost may be reasonable

Hardware cost: \(HW \$ = \Theta(P)\) + wires of the grid

Number of nodes: \(2 \cdot \sqrt(P) - 1 = \Theta(P)\)

Hardware cost of grid tree:

\(\Theta(P) + \#cols \cdot tree + \#rows \cdot tree = \Theta(P) + \sqrt{P} (\Theta(P)) + \sqrt{P} (\Theta(P)) = \Theta(P)\)

How to make sure grid behaves?

  • roll into cylinder → longest path \(\sqrt{P} - 1\)
  • shape like a bagel → max path cut into half \(\sqrt{P} / 2\)
    • with the bagel shape all path distances are cut in half in \(x\) and \(y\) dimension

img

Multi-connection - Interconnection Network

  • first introduced in telephony
  • no contention → if wire is already in use set busy signal
  • if not busy, allow routing any permutation A - B
  • this signal has inspired icons for switches

img

Omega/Shuffle Exchange

  • Imagine the way of shuffling 8 cards → split into two then interleave them
  • claim: will reach the destination → all paths 0...8 can reach 0...8 outputs
  • within this tree is an embedded tree structure
  • in \(\log\) time, can go from any input to any output
    • 0: go north
    • 1: go south

img

Can interpret the omega network/shuffle exchange as processors → memory

  • the architecture works in both directions
  • hardware cost \(HW \$ = \Theta(P) + \# \text{stages} \cdot \text{cost of stage} = \Theta(P) + \log_{2}P \cdot \frac{P}{2} = \Theta(P \lg P)\)
  • if processors are expensive and switches are cheap, then this architecture may be worthwhile to use
  • all paths have the same length → uniform min (uniform multi-connection network)

This architecture can perform summation task:

  • sum all leaves → broadcast op-code to all leaves
  • when memory gets the message, return memory value in opposite direction
  • when switch receives a message, it sums the inputs
  • requester receives the result in \(2 \cdot \log P\) time

It can do any associative operation this way at no delay relative to the PRAM because it is tailored to this architecture

Simulation using shuffle exchange:

  • \(T_{PRAM} = \Theta(\lg P)\)
  • \(W_{PRAM} = P \cdot \Theta(\lg P) = \Theta(P \lg P)\)
  • \(T_{sim} = \Theta(\lg P)\)
  • \(W_{sim} = HW \$ \cdot \Theta(\lg P) = \Theta(P \lg^2 P)\)