Goal: Create a fault tolerant algorithm that ensures that all participants in a transaction agree to commit (make their actions permanent). If agreement is not reached, then all group members must agree to abort (undo any changes).
A transaction is a set of operations that operates on, and often modifies, data. Transactons have been typically associated with databases and the data they operate on: tables, rows within a table, and fields within a row. While they are used extensively in database systems, the principles of transactions and commit protocols can apply to any software systems.
A key facet of a transaction is that it keeps data consistent even in case of system failures. Transactions are atomic — all results must be made permanent (commit) and appear to anyone outside the transaction as an indivisible action. If a transaction cannot complete, it must abort, reverting the state of the system to that before it ran. If several transactions run concurrently, the end result must be the same as if those transactions ran in some (any) serial order; transactions cannot interfere with each other’s data. The specific order in which transactions execute is known as a schedule.
A transaction-based model guarantees a set of properties known as ACID:
- The transaction happens as a single indivisible action. The entire transaction succeeds (and the transaction commits) or else the entire transaction is rolled back (the transaction aborts). It is not acceptable for a transaction to complete only partially.
- A transaction cannot leave the data in an inconsistent state. If the system has invariants, they must hold after the transaction. For example, the total amount of money in all accounts must be the same before and after a “transfer funds” transaction. This also affects replicated data. If distributed copies of replicated data must be identical then they must remain identical after a transaction.
- Isolated (Serializable)
- Transactions cannot interfere with each other. If transactions run at the same time, the final result must be the same as if they executed in some serial order. One transaction cannot access intermediate results of another.
- Once a transaction commits, the results are made permanent. No system or software failures after a commit will cause the results to revert. Conversely, until the transaction commits, its results should be revertible. If the transaction must abort, it must return any modified data to its state before the transaction.
A write-ahead log (or transaction log) is used to enable rollback: reverting to the previous state when aborting a transaction. It is also crucial in maintaining the state of the transaction. The system will be able to read the log and recover to where it was in the transaction commit protocol. Write-ahead logs must be in stable storage (e.g., disk or flash memory) so the data and transaction state could be recovered if the system dies and recovers.
In a distributed transaction environment, multiple processes participate in a transaction, each executing its own sub-transaction that can commit only if there is unanimous consensus by all participants to do so. Each system runs a transaction manager, a process that is responsible for participating in the commit algorithm algorithm to decide whether to commit or abort its sub-transaction. One of these transaction managers may be elected as the coordinator and initiates and runs the commit algorithm. Alternatively, the coordinator could be a separate process from any of the transaction participants.
Two-phase commit protocol
The two-phase commit protocol uses atomic multicasts to reach a consensus among the group on whether to commit or abort. It uses a coordinator to send a request (“can you commit?”) to every member of the group (reliably, retransmitting as often as needed until all replies are received). Phase 1 is complete when every member of the group (each participant) responds. If the coordinator gets even a single abort response from a participant, it must tell all participants to abort the entire transaction. Otherwise, it will tell all participants to commit it. In phase 2, the coordinator sends the commit or abort order and waits for a response from everyone. In summary, in phase 1, the coordinator gets everyone’s agreement and in phase 2, the coordinator sends the directive to commit or abort.
The write-ahead log in stable storage is crucial for ensuring atomic multicasts (the write-ahead log is also important for transaction rollback, which is used for aborts). For example, if a participant sent the coordinator a commit response for phase 1 and then died, it must be able to reboot and reconstruct the transaction state from the log; it cannot change its mind after rebooting. The two-phase commit protocol cannot proceed until each participant acknowledges each message.
The two phase commit stalls if any member – the coordinator or any participant – dies. It has to wait for the recovery of that member before proceeding with the protocol. A recovery coordinator can step in in certain circumstances. If the coordinator died and a recovery coordinator took over, it queries the participants. If at least one participant has received a commit message then the new coordinator knows that the vote to commit must have been unanimous and it can tell the others to commit. If no participants received a commit message then the new coordinator can restart the protocol. However, if one of the participants died along with the coordinator, confusion may arise. If all the live participants state that they have not received a commit message, the coordinator does not know whether there was a consensus and the dead participant may have been the only one to receive the commit message (which it will process when it recovers). As such, the coordinator cannot tell the other participants to make any progress; it must wait for the dead participant to come back.
Three-phase commit protocol
The two-phase commit protocol is a blocking protocol that relies on a fail-restart failure model. If the coordinator or any participant crashes, the entire protocol stalls until the failed process is restarted. The three-phase commit protocol is similar to the two-phase commit protocol but allows entities to time out in certain cases to avoid indefinite waits. the role of the coordinator.
The protocol also enables the use of a recovery coordinator by introducing an extra phase where all participants are informed of the decision to commit before any participant can commit. This way, any participant could inform a standby coordinator whether there was a unanimous decision to commit and some commits may have taken place. The three-phase commit protocol propagates the knowledge of the outcome of the election to all participants before starting the commit phase.
In phase 1 of the three-phase commit protocol, the coordinator sends a query to commit request (“can you commit?”) to every member of the group (reliably, retransmitting as often as needed until all replies are received over a period of time). If any of the participants respond with a no or if any participants failed to respond within a defined time, then the coordinator sends an abort to every participant.
In phase 2, the coordinator sends a prepare-to-commit message to all participants and gets acknowledgements from everyone. When this message is received, a participant knows that the unanimous decision was to commit. If a participant fails to receive this message in time, then it aborts. At this point, the participants do not commit. However, if a participant receives an abort message then it can immediately abort the transaction. Each of these messages must be acknowledged by all participants before the coordinator proceeds to phase 3.
In phase 3, the coordinator sends a commit message to all participants telling them to commit. If a participant fails to receive this message, it commits anyway since it knows from phase 2 that there was a unanimous decision to commit. If a coordinator crashes during this protocol, another one can step in and query the participants for the commit decision. If every participant received the prepare-to-commit message then the coordinator can issue the commit directives. If only some participants received the message, the coordinator now knows that the unanimous decision was to commit and can re-issue the prepare-to-commit request followed by a commit. If no participant received the message, the coordinator can restart to protocol or, if necessary, restart the transaction.
The three-phase commit protocol accomplishes two things:
Enables use of a recovery coordinator. If a coordinator died, a recovery coordinator can query a participant.
If the participant is found to be in phase 2, that means that every participant has completed phase 1 and voted on the outcome. The completion of phase 1 is guaranteed. It is possible that some participants may have received commit requests (phase 3). The recovery coordinator can safely resume at phase 2.
If the participant was in phase 1, that means NO participant has started commits or aborts. The protocol can start at the beginning
If the participant was in phase 3, the coordinator can continue in phase 3 – and make sure everyone gets the commit/abort request
Every phase can now time out – there is no indefinite wait as in the two-phase commit protocol.
- Phase 1:
- Participant aborts if it doesn’t hear from a coordinator in time
- Coordinator sends aborts to all if it doesn’t hear from any participant
- Phase 2:
- If coordinator times out waiting for a participant – assume it crashed, tell everyone to abort
- If participant times out waiting for a coordinator, elect a new coordinator
- Phase 3:
- If a participant fails to hear from a coordinator, it can contact any other participant for results
The three-phase commit protocol suffers from two problems. First, a partitioned network may cause a subset of participants to elect a new coordinator and vote on a different transaction outcome. Secondly, it does not handle fail-recover well. If a coordinator that died recovers, it may read its write-ahead log and resume the protocol at what is now an obsolete state, possibly issuing conflicting directives to what already took place. The protocol does not work well with fail-recover systems.
Brewer’s CAP Theorem
Ideally, we would like to have three properties in a replicated data distributed system:
- The data retrieved from the system should be the same regardless of which server is contacted.
- The system should always be available to handle requests.
- Partition tolerance
- The system should continue to function even if some network links do not work and one group of computers cannot talk to another. For example, a link between two data centers may go down.
Eric Brewer’s CAP Theorem states that you can have either consistency or availability in the presence of partitions but not both. Quite often, the theorem is summarized as: if you want consistency, availability, and partition tolerance, you have to settle for at most two out of three of these. However, we expect real-world distributed systems to be at risk of network partitions, so the choice is really between consistency and availability.
With distributed architectures, we generally strive for high availability and the ability to withstand partitions (occasional breaks in the ability of nodes to communicate). Hence, we will have to give up on consistency and break the guarantees of ACID. An alternative to the requirements of ACID is BASE.
BASE stands for Basic Availability, Soft-state, Eventual consistency. Instead of requiring consistency after every transaction, it is enough for the data to eventually get updated to a consistent state. The downside is that some processes may access stale data which has not yet been brought into a consistent state.