Johannes Karl Arnold

Distributed Systems


Consensus Algorithms

Multiple independent processes need to agree on an outcome.

A correct consensus protocol fulfills the following criteria:

Termination
every correct process decides some value
Integrity
if all processes proposed the same value (v), then any correct process must decide (v)
Agreement
Every correct process must agree on the same value

Faults and Failures

Types of Faults

Transient
Occurs only once and then goes away
Intermittent
Irregular, comes and goes
Permanent
Occurs only once and then goes away. Continues to exist (until repair)

Propagation

Fault
the cause of an error
Error
part of a system’s state that may lead to a failure
Failure
a system does not operate according to its specifications

Types of Failures

Crash failure
A server halts, but is working correctly until it halts
Omission failure
A server fails to respond to incoming requests, fails to receive incoming messages or fails to send messages
Timing failure
A server’s response lies outside the specified time interval
Response failure
The server’s response is incorrect, value of the response is wrong, or server deviates from the correct flow of control
Arbitrary (Byzantine) Failure
A server may produce arbitrary responses at arbitrary times

(k)-Fault-Tolerance

Fail-Silent

[ n=k+1 ] Replica produces either correct responses or no response at all.

Logical Time

  • Time is a canonical way of establishing ordering between events
  • In distributed systems, time is not a global variable because machines have independently running clocks

Lamport Timestamps

Propose we have two events, (A) and (B). Lamport timestamps allow for a “happens before” relationship, (A \to B.)

Each node (p) maintains a counter (LT(p)), which gets incremented when (p) performs an action. When (p) sends a message, it includes (LT(p)). A recieving node (q) then updates its own counter to (\max{LT(p),~LT(q)})