Due to how Hashicorps Vault and Consul work it isn’t easy to make them work in a satisfying, georedundant setup.
Before getting to the problem and the solution I’ll explain what and how consul and consul-replicate do in the context of vault.
The solution to georedundancy in this context is simple and already has been applied for other software in the very same way. This article is specifically aimed at the solution in the context of vault, consul and consul-replicate.
Definitions
Vault - shared credential storage
Consul - Zookeeper. In Go. From Hashicorp.
Consul-Replicate - k/v replication daemon for consul
Gossip - Ambivalent used for the Gossip protocol and the communication between LAN and WAN nodes (LAN gossip, WAN gossip). Realized through the Serf protocol.
Consul
Consul is a lot like Zookeeper - it provides service-discovery, basic health-checks and a key/value storage. Multi-DC setups are possible and supported, though for the k/v storage this means relaying requests to other dc’s instead of cross-dc replication.
As seen in the figure above a Consul cluster over multiple DCs consists of at least five parts:
- two Consul nodes
- two LAN clusters
- one WAN cluster
LAN clusters share information through Serf’s Gossip protocol, which includes the key/value-storage. WAN clusters on the other hand share only the service definitions and health checks.
The k/v storage is special in this case since consul doesn’t keep a transaction log. Doing master-master replication without such is not not possible1. Cross-DC connections have a delay too high for simple replication.
As a compromise UDP could be used, but this would result in essentially no data integrity, which is not acceptable for use with a credential storage.
Consul-Replicate
Consul-Replicate is Hashicorps solution to realise cross-dc replication of Consul’s k/v storage.
The premise is simple - consul-replicate retrieves a all k/v pairs from a set prefix from a defined dc and applies these exactly to the local2 k/v storage, from where the information is transferred to the other nodes through gossip.
That means the following applies:
- if a key in the remote dc exists, it is created locally
- if a key in the remote dc does not exist, it is deleted locally
Which brings in the problem.
The problem
Within the same dc the latency is low enough to distribute k/v pairs in a master-master fashion3. WAN gossip would most certainly cause lost information.
The same applies to k/v pairs replicated with consul-replicate.
Assume dc A and B, both replicate to the same prefix and the network in both directions is clear.
The following will happen:
- replicator in dc A retrieves k/v pairs from dc B
- replicator in dc B retrieves k/v pairs from dc A
- user writes k/v pair to the replicated prefix in dc B
- replicator in dc A gets the k/v pairs from dc B with the just written k/v pair
- replicator in dc B gets the k/v pairs from dc A without the just written k/v pair
- replicator in dc A writes the k/v pair that is missing locally
- replicator in dc B deletes the k/v pair the didn’t exist in dc A at the time of the requests
And this happens in tandem over and over until the delay for one of the requests is just high enough for the key to persist or to vanish.
This is the most common problem with master-master replication and usually solved by a transaction log with attached timestamps. Hence in master-master replication not values but changes are actually being replicated.
The solution
The solution is simple - don’t use master-master replication. Neither consul, nor consul-replicate, nor vault are suited for that task.
My solution to getting consul and vault georedundant is to created one consul-replicate configuration per target dc, using consuls locks (to only have one replication-daemon per dc) and a service wrapping around consul-replicate.
The service is just a shell script continuously checking for the vault leader, looking up which dc this leader is in and checking if it is another dc.
#!/usr/bin/env sh
# MIT License Copyright(c) Nelo-T. Wallus <nelo@wallus.de>
# These are my common functions, available a.o. on
# https://github.com/ntnn/scripts
. common
# this can be a templated list of all hosts, though a better way is to
# generate the list on the fly from a cmdb or monitoring system
vault_nodes="list of all vault nodes"
finish() {
kill -HUP $(jobs -p)
exit 0
}
trap finish EXIT KILL TERM HUP
leader_address() {
args_in $# 1 1 || return 1
curl -kqL https://$1/v1/sys/leader \
| sed -r 's#^.*leader_address":"https://(.*):[0-9]{1,5}".*$#\1#'
}
leader() {
# all nodes have to be chcked until a leader has been found,
# since only unlocked nodes return the current leader
for host in $vault_nodes; do
leader=$(leader_address $host)
test -n "$leader" && echo $leader && return
done
}
dc() {
# this is environment specific and hence removed
# there are multiple options to do this:
# 1. a cmdb
# 2. from hostname, if it contains the dc
# 3. querying a service running on the remote answering with
# the dc name (insecure unless e.g. tls client cert,
# shared secret or dns forward/reverse lookup auth is done
}
replicate() {
args_in $# 1 1 || errno=2 die no target dc specified
# start replication
# the locks/ prefix isn't replicated, so it is local to the dc
# this ensure that only one replication-daemon per dc is
# replication the k/v pairs
consul lock \
locks/consul-replicate \
-config /path/to/consul-replicate-config/$1.json \
&
# keep checking if the leader switched the dc
while test "$1" = "$(dc)"; do
sleep 1
done
# kill consul-replicate, the lock will be released
kill -HUP $(jobs -p)
}
while true; do
leader=$(leader)
leader_dc=$(dc $leader)
if test -z "$leader_dc"; then
log info No leader found, waiting for unsealed vaults
sleep 5
elif test "$(dc $(hostname -f))" != "$leader_dc"; then
log info "Leader '$leader' found, trying to obtain lock"
replicate $leader_dc
else
log info Leader is in this dc, sleeping
sleep 1
fi
done
My own script includes a few more sanity-checks, but these are environment-specific (also, they don’t contribute towards explaining the solution).
This script is handled by the service daemon. In theory it’d be possible to make consul-replicate it’s own service and let this script act as a hypervisor service, but that’d allow to enforce starting the consul-replicate service without checking the leader dc. Hence not a save and shippable solution.
The kill signal for the service is simply SIGHUP. The trap stops the replication-daemon gracefully and then exits.
Cross-DC Vault
As a side note, since that has regularly spawned questions on the mailing list and tickets - the replication has to be done before initializing vault. Otherwise each dc will have a different key and their own leaders, which - if the above approach is being used - causes all replication daemons to start replicating from the first dc they check (since each dc will have a leader), resulting in the original problem. Just messier.
When you have the replication set up initializing one vault node should bring the required k/v’s to the other dc’s, returning that they’re now initialized and allowing you to unlock them.
I want to note here that it is easier to use a patched consul-replicate
which allows to define a prefix and to exclude the prefixes you don’t
want to replicate, namely vault/sys/expire
- though I’m not so sure if
non-leaders are actually attempting to execute expiries.
Conclusion
In the end I don’t think this is a satisfying setup. It works, it is somewhat reliable, it doesn’t cause lost secrets unless your leader is constantly switching dc’s. However I’ll probably switch to a backend which supports cross-dc replication natively.