Skip to content

7. Chunking & Prefix

Feb 3, 2021

Padding

  • assume \(N\) where \(2^{k-1} < N < 2^k\) (\(N\) is not a power of 2) and \(P = 2^k\)
  • then \(2^{k-1} < N < N^{'} < 2^k\) and \(k-1 < \lg N^{'} < k\)
  • time: \(T_p = \lg N^{'} = k = \Theta(\lg N)\)
  • work: \(W_p = P \cdot \lg N^{'} = \Theta(N \lg N)\)
  • for an exact power we have:
    • \(T_p = \Theta(\lg N)\) and
    • \(W_p = \Theta(N \lg N)\)
  • padding to an exact power does not make a difference to complexity

Chunking

  • Consider an associative operation where \(P=N\) → only 1/2 of processors need on each height → high slackness

img

  • Next consider a case where \(P \ll N\) (the usual case)
  • Perform some associative operation on \(N\) then:
    • \(W = \Theta(N \lg N) > \Theta(N) = \Theta(N) = T_1^{*}(N)\)
    • when \(P=1\) optimal, not fast
    • when \(P=N\) fast, not optimal
  • How to be optimal and fast? Give each processor a chunk of \(N/P\) input

Phase I

  • Compute partial sums in leaves
  • Time of 1 chunk = \(\Theta(N/P) = T^{(1)}\)
  • Work of 1 chunk = \(\Theta(N) = W^{(1)}\)
  • So far optimal → now we have \(P\) partial sums in leaves

Phase II

  • Add the chunks
  • \(T^{(2)} = \Theta(\lg P)\)
  • \(W^{(2)} = P \cdot \Theta(\lg P) = \Theta(P \lg P)\)

  • total time: \(T^{(1)} + T^{(2)} = \Theta(\frac{N}{P} + \lg P)\)
  • total work: \(W^{(1)} + W^{(2)} = \Theta(N + P \lg P)\)
  • when is \(P \lg P\) optimal? (tree height \(\lt N\))
\[ P \lg P \leq N \\ P \leq \frac{N}{\lg P} \\ P \leq \frac{N}{\lg N} \leq \frac{N}{\lg P} \\ \]
  • solve for \(P = N / \lg N\):
    • \(1 \leq P \leq N/ \lg N\)
    • under these conditions algorithm is optimal
\[ T_{P,N} = \Theta(\frac{N}{N \lg N} + \lg(\frac{N}{\lg N})) = \Theta(\lg N + (\lg N - \lg \lg N)) \\ \]
\[ = \Theta(2 \lg N - \lg \lg N) \\ = \Theta(\lg N) \]
\[ W_{P, N} = P \cdot \Theta(\lg N) \\ = \frac{N}{\lg N} \cdot \Theta(\lg N) \\ = \Theta(N) \]
  • Time \(P = \frac{N}{\lg N} \leq \lg N + \lg N = 2 \lg N\)

First strategy to convert a non-optimal algorithm to optimal, is to try chunking

Prefix Computation

  • Challenge: compute the sum of \(X[1...N]\) array, such that:
    • \(S_1 = X[1]\)
    • \(S_2 = X[1] + X[2] = S_1 + X[2]\)
    • \(S_3 = X[1] + X[2] + X[3] = S_2 + X[3]\)
    • ...
    • \(S_k = \sum_{i}^{k}X[i]\)
    • ...
    • \(S_n = \sum_{i}^{n}X[i]\)
  • prefix sum includes self and everything to the left

img

Uniprocessor implementation

1
2
3
S[1] = X[1]
for 2...N
    S[i] = S[i - 1] + X[i]
  • Time (precise analysis): \(T_1 = 1 + (N-1) \cdot 1 = N\)

Parallel Prefix

  • similar to processor enumeration
  • almost all processors work → not similar slackness as with addition
  • \(T_p = \Theta(\log N)\)
  • \(W = P \cdot T_P = \Theta(N \lg N) > T_1^{*}(N)\)
  • Not optimal, but log efficient

Strategy 1: do in 2 passes 1. bottom-to-top summation 2. top-to-bottom add left child

Strategy 2: try chunking - \(P = 1\) slow and optimal, \(P = N\) fast and not optimal → choose chunk size \(\lg N\)

steps: 1. sum chunks: \(T_{(1)} = N/P\) (\(< \lg N\)) 2. compute prefix sum (\(z_{1-4}\)): \(T_{(2)} = \lg P\) (\(< \lg N\)) 3. distribute: \(T_{(3)} = N/P\) (\(< \lg N\))

steps combined: \(T_{(1)} + T_{(2)} + T_{(3)} = 3 \lg N\)

  • Total time: \(T_{total} = N/P + \lg P + N/P = \Theta(N/P + \lg P)\)
  • Work: \(W = \Theta(P \cdot (N/P) + P \lg P) = \Theta(N + P \lg P)\)
  • Optimal \(P\) when \(P = N / \lg N\)

Example: Prefix with Chunks

img

For example, \(P_2\) computes: 1. \(S_4 = X_4 + Z_1\) 2. \(S_5 = X_5 + S_4\) 3. \(S_6 = X_6 + S_5\)

  • to compute \(X_4 = X_4 + temp_1\)
  • to compute \(X_7 = X_7 + temp_1 + temp_2\)
  • before these can be computed, need to know the value to the left