Chain Replication Database
In banking system demands are very tight. Database should be at least tripled, stand-by nodes should pick up master reads from failover node, writes should be accepted on a reasonble quorum, failover must be followed by recovery, database should be able to scale even with the RAM/DISC limitations.
No data should be treated as written otherwise that commited to all replicas. All this circumstances leads us to chain replication protocol as a simple and natural feedback to this challenge.
Different replication techniques exists to satisfy replication demands. Master-slave replication is most widely known type of replication used before in such products like GFS, HDFS, mongodb, etc. Quorum Intersection is another technique used in databases like Cassandra or Amazon Dynamo. They mostly provide a consistent distributed repository for event tables or for file storage. In banking industry we synchronize account balances and need simple and managable protocol for storage consistency issuing high demand on system integrity.
There are several classes of error usually implied when dealing with failure detection. The most weak class is fail-stop events, when the outage is normal or predictable. The second class is crash-failures, the ubnormal terminations and outages. The most strong type of failures are byzantine failures resistant to bit-flips, hacked parties or any types of compromising the transaction objects. For banking applications the byzantine fault tolerance is desired, despite it affects the latency. However we will show that CR latency is acceptible even in compare with web applications.
Features
- CP database
- 2N+1 nodes tolerates N failures
- Consistent hashing DHT
- RAFT for managing server configurations timeline
- HMAC signing for Byzantine capabilities
- Various database backends: mnesia, riak, redis, fs, sql
- High-performance non-blocking TCP acceptor
- Separate endpoints for HEART, CLIENT and SERVER protocols
- Pure, clean and understandable codebase
Consistent Hash Ring
Bulding a consistent hash ring is a key feature that opens a door to the distributed system. CR is using only five functions to model the DHT ring. Ring provides a desirable probability in series of nines of working event condition.
The ring or configuration is partitioned by shards or peers.
Each peer is running several replica protocol vnodes. Each vnode is a replica process that serves a specific key-range.
Chain Replication Protocol
Command
Command is an atomic event that can be performed in single process context at a single machine.
CR provides extensible set of possible commands:
- PUT the object to database
- LINK the object to some doubly-linked list
- REMOVE the object from the list and database
This set of commands refers to KVS the database framework for storing the doubly linked lists (it can be called chains/feeds/sequences) using the two basic record types: #container, who store the top of a chain along with chain aggregation counters; and #iterator, who provides next and prev fields for traversal.
Distributed Transaction
All replicas are sequenced into the chains. Transaction is a command performing forward over the ordered chain of replicas. This chain is called configuration. All writes come to the chain's head, all reads come to chain's tail.
Replication Log
During transaction, the command is saved in replication log on each replica of the transaction. This log is append-only disk structure and is also called this history of replica's operations.
The replication log is also uses KVS as underlying storage. As a replication log container it uses #log type and command is stored as #operation record. Each replica has its own log.
Replica Protocol
Some assumptions are implied during protocol description.
- 1) each peer has at least one non-faulty vnode;
- 2) ring is tracked by external consensus or
- 3) ring has at least one peer with no faulty vnodes.
#operation [Vnode,Chain,Operation] — Any active replica Vnode in configuration Chain can issue an operation command only if each preceding replica in Chain, if any, has done likewise and there is no conflicting operation for s in its history. Vnode also adds a new order proof to its history.
#suspend [Vnode] — An active replica Vnode can suspend updating its history by becoming immutable at any time. Only heart monitor can issue a becomeImmutable message. The replica signs a wedged statement to notify heart monitor that it is immutable and what its history is.
#resume [Vnode,Configuration,History] — A pending replica Vnode in Configuration can resume handling operations if the Heart Monitor has synchronized the history between nodes to the greatest common prefix log.
Failures
Configuration Tracking
The configuration is a dynamic property of transaction. During transaction it may change due to byzantine failures, leading us to reconfigure the replicas in a chain. The another consistent system is needed to track the dynamic configurations.
To make the shard highly available, we use replication and dynamically change the configuration of replicas in order to deal with crash failures and unresponsiveness. Each machine in a cluster has single append-only configuration log which is not based on KVS due to latency requirements. Configuration log is a binary file written by RAFT protocol commands. There is only two commands which could be performed over the configuration log:
- ADD replica to configuration
- DELETE replica from configuration
Heart Monitor Protocol
#reconfig [Node,Configuration,NewConfiguration] — The heart monitor waits for a set of valid histories from a quorum of replicas in current configuration. A valid history contains at most one record per operation. The oracle then issues an #resume message for all nodes in NewConfiguration with the log position of maximal common prefix (last replica in previous Configuration). The heart monitor can issue at most one #resume message per Configuration generation.
#ping — Round-Robin ping over nodes of Configuration. In initial configuration all nodes are active or resumed.
Safety
Stable Operation Log
The equation specifies what operations O are safe, when all its replicas are commited. but not when or in what order to do them. In other words, the system is asynchronous. In this formula we call stable operation log having operations commited on all replicas.
NOTE: due to asynchronous nature of transaction service the operations log will be always unordered. As on Picture 3 it should GCP = 2.
Liveness
There is always eventually a configuration in which all replicas are correct and do not become suspended. Failure detection of liveness is tracked by Heart Monitor which pings each node and reconfigures the nodes for synchronizing the configuration consensus log.
OTP protocol
Some types are embedded in L core to resolve main tasks during type inference, type unification and patterm maching compilation. L has following basic types which are used by infer/unify/match core. These types are also shared with Type Inspector.
INTERCONNECT
- transaction
- get
- sync
PING
- ping
- join
- leave
Implementation
The chain replication protolcol is implementes as Erlang/OTP application cr that could be embeded in any toplevel application. We use one supervision tree and gen_server per one TCP endpoint along with separate vnode_sup supervision for VNODE transactional contexts per hashring vnode.
The Chain Replication Database application is built using Synrc Application Stack. Among them we have fs native file-system listener, sh shell executor for running external commands, powerful mad rebar replacement which is able to pack application inside single-file bundle. During development we also use otp.mk and active file reloader that uses native filesystem event on each platform. The database itself built using kvs with mnesia backend and db banking schema as example.
Supervision tree of chain replication supervisor:
For benchmarking database please populate the it with data but without overloading the database:
Literature
[1]. Hussam Abu-Libdeh, Robbert van Renesse, Ymir Vigfusson.
Leveraging Sharding in the Design of Scalable Replication Protocols
[2]. Robbert van Renesse, Chi Ho, Nicolas Schiper.
Byzantine Chain Replication
[3]. Robbert van Renesse, Nicolas Schiper.
Chain Replication for Supporting High Throughput and Availability
Credits
- Maxim Sokhatsky
- Vladimir Kirillov
- Sergey Klimenko
- Valery Meleshkin
- Victor Sovietov