Before getting to the problem and the solution I’ll explain what and how consul and consul-replicate do in the context of vault.
The solution to georedundancy in this context is simple and already has been applied for other software in the very same way. This article is specifically aimed at the solution in the context of vault, consul and consul-replicate.
Vault - shared credential storage
Consul - Zookeeper. In Go. From Hashicorp.
Consul-Replicate - k/v replication daemon for consul
Consul is a lot like Zookeeper - it provides service-discovery, basic health-checks and a key/value storage. Multi-DC setups are possible and supported, though for the k/v storage this means relaying requests to other dc’s instead of cross-dc replication.
As seen in the figure above a Consul cluster over multiple DCs consists of at least five parts:
- two Consul nodes
- two LAN clusters
- one WAN cluster
The k/v storage is special in this case since consul doesn’t keep a transaction log. Doing master-master replication without such is not not possible1. Cross-DC connections have a delay too high for simple replication.
As a compromise UDP could be used, but this would result in essentially no data integrity, which is not acceptable for use with a credential storage.
Consul-Replicate is Hashicorps solution to realise cross-dc replication of Consul’s k/v storage.
The premise is simple - consul-replicate retrieves a all k/v pairs from a set prefix from a defined dc and applies these exactly to the local2 k/v storage, from where the information is transferred to the other nodes through gossip.
That means the following applies:
- if a key in the remote dc exists, it is created locally
- if a key in the remote dc does not exist, it is deleted locally
Which brings in the problem.
Within the same dc the latency is low enough to distribute k/v pairs in a master-master fashion3. WAN gossip would most certainly cause lost information.
The same applies to k/v pairs replicated with consul-replicate.
Assume dc A and B, both replicate to the same prefix and the network in both directions is clear.
The following will happen:
- replicator in dc A retrieves k/v pairs from dc B
- replicator in dc B retrieves k/v pairs from dc A
- user writes k/v pair to the replicated prefix in dc B
- replicator in dc A gets the k/v pairs from dc B with the just written k/v pair
- replicator in dc B gets the k/v pairs from dc A without the just written k/v pair
- replicator in dc A writes the k/v pair that is missing locally
- replicator in dc B deletes the k/v pair the didn’t exist in dc A at the time of the requests
And this happens in tandem over and over until the delay for one of the requests is just high enough for the key to persist or to vanish.
This is the most common problem with master-master replication and usually solved by a transaction log with attached timestamps. Hence in master-master replication not values but changes are actually being replicated.
The solution is simple - don’t use master-master replication. Neither consul, nor consul-replicate, nor vault are suited for that task.
My solution to getting consul and vault georedundant is to created one consul-replicate configuration per target dc, using consuls locks (to only have one replication-daemon per dc) and a service wrapping around consul-replicate.
The service is just a shell script continuously checking for the vault leader, looking up which dc this leader is in and checking if it is another dc.#!/usr/bin/env sh # MIT License Copyright(c) Nelo-T. Wallus
My own script includes a few more sanity-checks, but these are environment-specific (also, they don’t contribute towards explaining the solution).
This script is handled by the service daemon. In theory it’d be possible to make consul-replicate it’s own service and let this script act as a hypervisor service, but that’d allow to enforce starting the consul-replicate service without checking the leader dc. Hence not a save and shippable solution.
The kill signal for the service is simply SIGHUP. The trap stops the replication-daemon gracefully and then exits.
As a side note, since that has regularly spawned questions on the mailing list and tickets - the replication has to be done before initializing vault. Otherwise each dc will have a different key and their own leaders, which - if the above approach is being used - causes all replication daemons to start replicating from the first dc they check (since each dc will have a leader), resulting in the original problem. Just messier.
When you have the replication set up initializing one vault node should bring the required k/v’s to the other dc’s, returning that they’re now initialized and allowing you to unlock them.
I want to note here that it is easier to use a patched consul-replicate
which allows to define a prefix and to exclude the prefixes you don’t
want to replicate, namely
vault/sys/expire - though I’m not so sure if
non-leaders are actually attempting to execute expiries.
In the end I don’t think this is a satisfying setup. It works, it is somewhat reliable, it doesn’t cause lost secrets unless your leader is constantly switching dc’s. However I’ll probably switch to a backend which supports cross-dc replication natively.