9. Failure Models
Feb 10, 2021
Failure Models
Recall PRAM model: synchronized, high bandwidth, no failures. Now remove assumption of no failures and assume there is some adversary working against us.
Failure type | Description |
---|---|
init | processor fails to start |
crash | processor stops, doesn't do anything else after that |
crash & restart | processor crashes but can restart |
byzantine | adversary can cause any arbitrary behavior |
There are more failure models: omission failure, transient failure, software failure, security failure, temporal failure, environmental perturbations, but omission failure occurred while taking notes.
Failure model - what type of crashes occur during computation - is adversary online/offline: - online adversary can cause dynamic failures during computation - offline adversary must choose failure model before computation starts - number of failures \(f < P\) → some processor always survives - adversary is omniscient
Example: Enumeration with dynamic failures
- processors zero memory before use, initially 3 failures occur
- 2nd pass: processors 2, 3 die
- 3rd pass: processors 5,7 die after clearing & writing
- final result overestimates number of available processors (real number: 1)
If there are dynamic failures during this computation, the result will be an overestimate
Overestimate is not bad; underestimate is bad. Why?
Consider \(N=1000\) and number of tasks 100. In both cases processor estimate is \(P = 10\):
- \(P\) is overestimate (actual \(P=2\)): \(W = 100 \times 2 = 200\)
(less work done than expected)
- \(P\) is underestimate (actual \(P=100\)): \(W = 100 \times 100 = 10,000\)
(more work done than expected)
With overestimate you can reallocate remaining work if all tasks are not done, but will not end up doing unnecessary work. Enumerate processors and reallocate work.
Work with Failures
- in a non-failure environment, work \(W\) is the area of \(P \cdot T_P\)
- with failures time to completion increases with failures
- with failures \(P \cdot T_P\) no longer a good way to estimate \(W\)
- commute the area under the curve into a step function to determine work: \(W = \sum_{1}^{T}(P_T)\) where \(P_t\) is the number of active processors at time \(t\).
Load Balancing Progress Tree
Given \(N\) tasks and \(P\) processors and \(N = P\), processors write 1 to the leaf when they are done with their task
- Perform bottom-up simple addition:
- everywhere where 1: there is a processor
- everywhere where 0: processor failure
- Result: 5 tasks completed and 3 tasks remain
- Load balance the remaining tasks
Next: assume more processors fails after the first pass (tasks are done but processors are dead)
- Assume we now have 3 live processors \(P=3\) e.g. \({1, 4, 6}\)
- Look for the remaining tasks: \(8 - 5 = 3\) then use the same tree as a load balancing tree to find these 3 undone tasks
- iterate top to bottom:
- \(4 - 2 = 2\) tasks remain on left → send \(2/3\) of \(P\) this way
- \(4 - 3 = 1\) task remains on right → send \(1/3\) of \(P\) this way
- Repeat until finding the tasks