Distributed Systems
Consensus Algorithms
Multiple independent processes need to agree on an outcome.
A correct consensus protocol fulfills the following criteria:
- Termination
- every correct process decides some value
- Integrity
- if all processes proposed the same value (v), then any correct process must decide (v)
- Agreement
- Every correct process must agree on the same value
Faults and Failures
Types of Faults
- Transient
- Occurs only once and then goes away
- Intermittent
- Irregular, comes and goes
- Permanent
- Occurs only once and then goes away. Continues to exist (until repair)
Propagation
- Fault
- the cause of an error
- Error
- part of a system’s state that may lead to a failure
- Failure
- a system does not operate according to its specifications
- Faults lead to errors, errors lead to failures.
Types of Failures
- Crash failure
- A server halts, but is working correctly until it halts
- Omission failure
- A server fails to respond to incoming requests, fails to receive incoming messages or fails to send messages
- Timing failure
- A server’s response lies outside the specified time interval
- Response failure
- The server’s response is incorrect, value of the response is wrong, or server deviates from the correct flow of control
- Arbitrary (Byzantine) Failure
- A server may produce arbitrary responses at arbitrary times
(k)-Fault-Tolerance
Fail-Silent
[ n=k+1 ] Replica produces either correct responses or no response at all.
Logical Time
- Time is a canonical way of establishing ordering between events
- In distributed systems, time is not a global variable because machines have independently running clocks
Lamport Timestamps
Propose we have two events, (A) and (B). Lamport timestamps allow for a “happens before” relationship, (A \to B.)
Each node (p) maintains a counter (LT(p)), which gets incremented when (p) performs an action. When (p) sends a message, it includes (LT(p)). A recieving node (q) then updates its own counter to (\max{LT(p),~LT(q)})