23. Architectures

Mar 31, 2021

Hypercube Simulations

Consider an associative operation e.g. $\sum x$ on a PRAM: oblivious simulation degrades by $\log P$ factor.

There is better an oblivious simulation for this:

Step 1: send left to right, send in increasing order:
- 011 → 111: $X_3 + X_7$
- 001 → 101: $X_1 + X_5$
- 010 → 110: $X_2 + X_6$
- 000 → 100: $X_0 + X_4$
Step 2 repeat operation, but now in y dimension
- those nodes that already contributed their value do not need to participate in the computation
- 100 → 110: $X_0 + X_4 + X_2 + X_6$
- 101 → 111: $X_1 + X_5 + X_3 + X_7$
Step 3 only two nodes participate and add values in z-dimension
- result 110 → 111: $X_0 + X_1 + X_2 + X_3 + X_4 + X_5 + X_6 + X_7$

The efficiency of this simulation is $\lg P$, which is the same as the original algorithm → there is no loss in efficiency in this simulation.

Broadcast is another operation that is good on a hypercube

within a hypercube, there exists an embedded tree structure
there are many ways to use and interpret the architecture, here are two
1. leaves are processors, the rest are switches
2. all are processors
hypercube allows simulating log-tree algorithms without overhead
the weak point of this strategy is that root can become a bottleneck if all left nodes want to concurrently communicate with right side nodes → address such bottleneck by using a fat tree

"fat tree"

make the fat tree strong node (center) the tree root
less powerful nodes on sides with strong connecting wires
nodes weaken over distance
nodes can be understood as switches
this architecture can be packed tightly on a plane and reduces the impact of bottleneck root, while preserving the tree structure
it is easy to build and can be layered vertically also

Consider 2D grid architecture:

we are trying to look at each row and "plant a tree" over this row → plant a new tree over every row and column:

Message path is from node A to node B:

If there is not too much traffic, can reduce latency from $P^2$ to $\log P$

Hardware cost: $HW \$ = \Theta(P)$ + wires of the grid

Number of nodes: $2 \cdot \sqrt(P) - 1 = \Theta(P)$

Hardware cost of grid tree:

$\Theta(P) + \#cols \cdot tree + \#rows \cdot tree = \Theta(P) + \sqrt{P} (\Theta(P)) + \sqrt{P} (\Theta(P)) = \Theta(P)$

How to make sure grid behaves?

roll into cylinder → longest path $\sqrt{P} - 1$
shape like a bagel → max path cut into half $\sqrt{P} / 2$
- with the bagel shape all path distances are cut in half in $x$ and $y$ dimension

Can interpret the omega network/shuffle exchange as processors → memory

the architecture works in both directions
hardware cost $HW \$ = \Theta(P) + \# \text{stages} \cdot \text{cost of stage} = \Theta(P) + \log_{2}P \cdot \frac{P}{2} = \Theta(P \lg P)$
if processors are expensive and switches are cheap, then this architecture may be worthwhile to use
all paths have the same length → uniform min (uniform multi-connection network)

This architecture can perform summation task:

It can do any associative operation this way at no delay relative to the PRAM because it is tailored to this architecture

Simulation using shuffle exchange: