CR

Chain Replication Database

In banking system demands are very tight. Database should be at least tripled, stand-by nodes should pick up master reads from failover node, writes should be accepted on a reasonble quorum, failover must be followed by recovery, database should be able to scale even with the RAM/DISC limitations.

No data should be treated as written otherwise that commited to all replicas. All this circumstances leads us to chain replication protocol as a simple and natural feedback to this challenge.

Different replication techniques exists to satisfy replication demands. Master-slave replication is most widely known type of replication used before in such products like GFS, HDFS, mongodb, etc. Quorum Intersection is another technique used in databases like Cassandra or Amazon Dynamo. They mostly provide a consistent distributed repository for event tables or for file storage. In banking industry we synchronize account balances and need simple and managable protocol for storage consistency issuing high demand on system integrity.

There are several classes of error usually implied when dealing with failure detection. The most weak class is fail-stop events, when the outage is normal or predictable. The second class is crash-failures, the ubnormal terminations and outages. The most strong type of failures are byzantine failures resistant to bit-flips, hacked parties or any types of compromising the transaction objects. For banking applications the byzantine fault tolerance is desired, despite it affects the latency. However we will show that CR latency is acceptible even in compare with web applications.

Features

CP database
2N+1 nodes tolerates N failures
Consistent hashing DHT
RAFT for managing server configurations timeline
HMAC signing for Byzantine capabilities
Various database backends: mnesia, riak, redis, fs, sql
High-performance non-blocking TCP acceptor
Separate endpoints for HEART, CLIENT and SERVER protocols
Pure, clean and understandable codebase

Consistent Hash Ring

Bulding a consistent hash ring is a key feature that opens a door to the distributed system. CR is using only five functions to model the DHT ring. Ring provides a desirable probability in series of nines of working event condition.


 > cr:ring().

{12,
 [{0,0},
  {121791803110908576516973736059690251637994378581,1},
  {243583606221817153033947472119380503275988757162,1},
  {365375409332725729550921208179070754913983135743,1},
  {487167212443634306067894944238761006551977514324,1},
  {608959015554542882584868680298451258189971892905,2},
  {730750818665451459101842416358141509827966271486,2},
  {852542621776360035618816152417831761465960650067,2},
  {974334424887268612135789888477522013103955028648,2},
  {1096126227998177188652763624537212264741949407229,3},
  {1217918031109085765169737360596902516379943785810,3},
  {1339709834219994341686711096656592768017938164391,3},
  {1461501637330902918203684832716283019655932542972,3}]}

The ring or configuration is partitioned by shards or peers.


> cr:peers().

[{'[email protected]',9000,9001,9002},
 {'[email protected]',9004,9005,9006},
 {'[email protected]',9008,9009,9010}]

Each peer is running several replica protocol vnodes. Each vnode is a replica process that serves a specific key-range.


> cr:local().

[{487167212443634306067894944238761006551977514324,<0.200.0>},
 {365375409332725729550921208179070754913983135743,<0.199.0>},
 {243583606221817153033947472119380503275988757162,<0.198.0>},
 {121791803110908576516973736059690251637994378581,<0.197.0>}]


> cr:chain(foo).

[{1461501637330902918203684832716283019655932542972,3},
 {487167212443634306067894944238761006551977514324,1},
 {974334424887268612135789888477522013103955028648,2}]

Chain Replication Protocol

Command

Command is an atomic event that can be performed in single process context at a single machine.

CR provides extensible set of possible commands:

PUT the object to database
LINK the object to some doubly-linked list
REMOVE the object from the list and database

This set of commands refers to KVS the database framework for storing the doubly linked lists (it can be called chains/feeds/sequences) using the two basic record types: #container, who store the top of a chain along with chain aggregation counters; and #iterator, who provides next and prev fields for traversal.

Distributed Transaction

All replicas are sequenced into the chains. Transaction is a command performing forward over the ordered chain of replicas. This chain is called configuration. All writes come to the chain's head, all reads come to chain's tail.

Replication Log

During transaction, the command is saved in replication log on each replica of the transaction. This log is append-only disk structure and is also called this history of replica's operations.

The replication log is also uses KVS as underlying storage. As a replication log container it uses #log type and command is stored as #operation record. Each replica has its own log.

Replica Protocol

Some assumptions are implied during protocol description.

1) each peer has at least one non-faulty vnode;

2) ring is tracked by external consensus or

3) ring has at least one peer with no faulty vnodes.

#operation [Vnode,Chain,Operation] — Any active replica Vnode in configuration Chain can issue an operation command only if each preceding replica in Chain, if any, has done likewise and there is no conflicting operation for s in its history. Vnode also adds a new order proof to its history.

#suspend [Vnode] — An active replica Vnode can suspend updating its history by becoming immutable at any time. Only heart monitor can issue a becomeImmutable message. The replica signs a wedged statement to notify heart monitor that it is immutable and what its history is.

#resume [Vnode,Configuration,History] — A pending replica Vnode in Configuration can resume handling operations if the Heart Monitor has synchronized the history between nodes to the greatest common prefix log.

Failures

Configuration Tracking

The configuration is a dynamic property of transaction. During transaction it may change due to byzantine failures, leading us to reconfigure the replicas in a chain. The another consistent system is needed to track the dynamic configurations.

To make the shard highly available, we use replication and dynamically change the configuration of replicas in order to deal with crash failures and unresponsiveness. Each machine in a cluster has single append-only configuration log which is not based on KVS due to latency requirements. Configuration log is a binary file written by RAFT protocol commands. There is only two commands which could be performed over the configuration log:

ADD replica to configuration
DELETE replica from configuration

Heart Monitor Protocol

#reconfig [Node,Configuration,NewConfiguration] — The heart monitor waits for a set of valid histories from a quorum of replicas in current configuration. A valid history contains at most one record per operation. The oracle then issues an #resume message for all nodes in NewConfiguration with the log position of maximal common prefix (last replica in previous Configuration). The heart monitor can issue at most one #resume message per Configuration generation.

#ping — Round-Robin ping over nodes of Configuration. In initial configuration all nodes are active or resumed.

Safety

Stable Operation Log

The equation specifies what operations O are safe, when all its replicas are commited. but not when or in what order to do them. In other words, the system is asynchronous. In this formula we call stable operation log having operations commited on all replicas.


Stable = [ R || R <- replicas(O),
                status(R) == commited,
                length(R) == N ]

NOTE: due to asynchronous nature of transaction service the operations log will be always unordered. As on Picture 3 it should GCP = 2.

Liveness

There is always eventually a configuration in which all replicas are correct and do not become suspended. Failure detection of liveness is tracked by Heart Monitor which pings each node and reconfigures the nodes for synchronizing the configuration consensus log.

OTP protocol

Some types are embedded in L core to resolve main tasks during type inference, type unification and patterm maching compilation. L has following basic types which are used by infer/unify/match core. These types are also shared with Type Inspector.

INTERCONNECT

transaction
get
sync

PING

ping
join
leave

Implementation

The chain replication protolcol is implementes as Erlang/OTP application cr that could be embeded in any toplevel application. We use one supervision tree and gen_server per one TCP endpoint along with separate vnode_sup supervision for VNODE transactional contexts per hashring vnode.

The Chain Replication Database application is built using Synrc Application Stack. Among them we have fs native file-system listener, sh shell executor for running external commands, powerful mad rebar replacement which is able to pack application inside single-file bundle. During development we also use otp.mk and active file reloader that uses native filesystem event on each platform. The database itself built using kvs with mnesia backend and db banking schema as example.


> application:which_applications().

[{cr,"Chain Replication","0.1"},
 {sh,"VXZ SH Executor","0.9"},
 {mad,"MAD VXZ Build Tool","2.2"},
 {db,"Bank Database","1"},
 {active,"ACT VXZ Continuous Compilation","0.9"},
 {kvs,"KVS Abstract Term Database","1"},
 {mnesia,"MNESIA  CXC 138 12","4.12.3"},
 {fs,"VXZ FS Listener","0.9.1"},
 {stdlib,"ERTS  CXC 138 10","2.2"},
 {kernel,"ERTS  CXC 138 10","3.0.3"}]

Supervision tree of chain replication supervisor:


> cr:sup().

[{vnode_sup,<0.52.0>},
 {client_sup,<0.51.0>},
 {client,<0.50.0>},
 {ping_sup,<0.289.0>},
 {ping,<0.48.0>},
 {interconnect_sup,<0.47.0>},
 {interconnect,<0.46.0>}]

For benchmarking database please populate the it with data but without overloading the database:


    [
      begin
          cr:test(500),
          timer:sleep(1000)
      end
          || ___ >- lists:seq(1,10)
    ].

> cr:dump().
                                               vnode   i  n        top      log        latency
    121791803110908576516973736059690251637994378581   1  1       6506     1607       1/315/97
    243583606221817153033947472119380503275988757162   2  1       6508     1662      1/317/100
    365375409332725729550921208179070754913983135743   3  1       6510     1658      2/317/105
    487167212443634306067894944238761006551977514324   4  1       6505     1583      1/317/104
    608959015554542882584868680298451258189971892905   5  2       6499     1637      3/317/115
    730750818665451459101842416358141509827966271486   6  2       6510     1664      2/318/117
    852542621776360035618816152417831761465960650067   7  2       6501     1634      2/311/115
    974334424887268612135789888477522013103955028648   8  2       6500     1575       3/290/96
   1096126227998177188652763624537212264741949407229   9  3       6497     1607      3/316/118
   1217918031109085765169737360596902516379943785810  10  3       6510     1662      3/318/117
   1339709834219994341686711096656592768017938164391  11  3       6496     1658      3/311/106
   1461501637330902918203684832716283019655932542972  12  3       6505     1583      2/295/104

Literature

[1]. Hussam Abu-Libdeh, Robbert van Renesse, Ymir Vigfusson.
Leveraging Sharding in the Design of Scalable Replication Protocols

[2]. Robbert van Renesse, Chi Ho, Nicolas Schiper.
Byzantine Chain Replication

[3]. Robbert van Renesse, Nicolas Schiper.
Chain Replication for Supporting High Throughput and Availability

Credits

Maxim Sokhatsky
Vladimir Kirillov
Sergey Klimenko
Valery Meleshkin
Victor Sovietov