Due to how Hashicorps Vault and Consul work it isn’t easy to make them work in a satisfying, georedundant setup.

Before getting to the problem and the solution I’ll explain what and how consul and consul-replicate do in the context of vault.

The solution to georedundancy in this context is simple and already has been applied for other software in the very same way. This article is specifically aimed at the solution in the context of vault, consul and consul-replicate.

Definitions

Vault - shared credential storage

Consul - Zookeeper. In Go. From Hashicorp.

Consul-Replicate - k/v replication daemon for consul

Gossip - Ambivalent used for the Gossip protocol and the communication between LAN and WAN nodes (LAN gossip, WAN gossip). Realized through the Serf protocol.

Consul

Consul is a lot like Zookeeper - it provides service-discovery, basic health-checks and a key/value storage. Multi-DC setups are possible and supported, though for the k/v storage this means relaying requests to other dc’s instead of cross-dc replication.

Consul-DC-spanning-request

As seen in the figure above a Consul cluster over multiple DCs consists of at least five parts:

LAN clusters share information through Serf’s Gossip protocol, which includes the key/value-storage. WAN clusters on the other hand share only the service definitions and health checks.

The k/v storage is special in this case since consul doesn’t keep a transaction log. Doing master-master replication without such is not not possible1. Cross-DC connections have a delay too high for simple replication.

As a compromise UDP could be used, but this would result in essentially no data integrity, which is not acceptable for use with a credential storage.

Consul-Replicate

Consul-Replicate is Hashicorps solution to realise cross-dc replication of Consul’s k/v storage.

The premise is simple - consul-replicate retrieves a all k/v pairs from a set prefix from a defined dc and applies these exactly to the local2 k/v storage, from where the information is transferred to the other nodes through gossip.

That means the following applies:

Which brings in the problem.

The problem

Within the same dc the latency is low enough to distribute k/v pairs in a master-master fashion3. WAN gossip would most certainly cause lost information.

The same applies to k/v pairs replicated with consul-replicate.

Assume dc A and B, both replicate to the same prefix and the network in both directions is clear.

The following will happen:

  1. replicator in dc A retrieves k/v pairs from dc B
  2. replicator in dc B retrieves k/v pairs from dc A
  3. user writes k/v pair to the replicated prefix in dc B
  4. replicator in dc A gets the k/v pairs from dc B with the just written k/v pair
  5. replicator in dc B gets the k/v pairs from dc A without the just written k/v pair
  6. replicator in dc A writes the k/v pair that is missing locally
  7. replicator in dc B deletes the k/v pair the didn’t exist in dc A at the time of the requests

And this happens in tandem over and over until the delay for one of the requests is just high enough for the key to persist or to vanish.

This is the most common problem with master-master replication and usually solved by a transaction log with attached timestamps. Hence in master-master replication not values but changes are actually being replicated.

The solution

The solution is simple - don’t use master-master replication. Neither consul, nor consul-replicate, nor vault are suited for that task.

My solution to getting consul and vault georedundant is to created one consul-replicate configuration per target dc, using consuls locks (to only have one replication-daemon per dc) and a service wrapping around consul-replicate.

The service is just a shell script continuously checking for the vault leader, looking up which dc this leader is in and checking if it is another dc.

#!/usr/bin/env sh
# MIT License Copyright(c) Nelo-T. Wallus <nelo@wallus.de>

# These are my common functions, available a.o. on
# https://github.com/ntnn/scripts
. common

# this can be a templated list of all hosts, though a better way is to
# generate the list on the fly from a cmdb or monitoring system
vault_nodes="list of all vault nodes"

finish() {
    kill -HUP $(jobs -p)
    exit 0
}
trap finish EXIT KILL TERM HUP

leader_address() {
    args_in $# 1 1 || return 1
    curl -kqL https://$1/v1/sys/leader \
        | sed -r 's#^.*leader_address":"https://(.*):[0-9]{1,5}".*$#\1#'
}

leader() {
    # all nodes have to be chcked until a leader has been found,
    # since only unlocked nodes return the current leader
    for host in $vault_nodes; do
        leader=$(leader_address $host)
        test -n "$leader" && echo $leader && return
    done
}

dc() {
    # this is environment specific and hence removed
    # there are multiple options to do this:
    #   1. a cmdb
    #   2. from hostname, if it contains the dc
    #   3. querying a service running on the remote answering with
    #      the dc name (insecure unless e.g. tls client cert,
    #      shared secret or dns forward/reverse lookup auth is done
}

replicate() {
    args_in $# 1 1 || errno=2 die no target dc specified

    # start replication
    # the locks/ prefix isn't replicated, so it is local to the dc
    # this ensure that only one replication-daemon per dc is
    # replication the k/v pairs
    consul lock \
        locks/consul-replicate \
        -config /path/to/consul-replicate-config/$1.json \
        &

    # keep checking if the leader switched the dc
    while test "$1" = "$(dc)"; do
        sleep 1
    done

    # kill consul-replicate, the lock will be released
    kill -HUP $(jobs -p)
}

while true; do
    leader=$(leader)
    leader_dc=$(dc $leader)

    if test -z "$leader_dc"; then
        log info No leader found, waiting for unsealed vaults
        sleep 5
    elif test "$(dc $(hostname -f))" != "$leader_dc"; then
        log info "Leader '$leader' found, trying to obtain lock"
        replicate $leader_dc
    else
        log info Leader is in this dc, sleeping
        sleep 1
    fi
done

My own script includes a few more sanity-checks, but these are environment-specific (also, they don’t contribute towards explaining the solution).

This script is handled by the service daemon. In theory it’d be possible to make consul-replicate it’s own service and let this script act as a hypervisor service, but that’d allow to enforce starting the consul-replicate service without checking the leader dc. Hence not a save and shippable solution.

The kill signal for the service is simply SIGHUP. The trap stops the replication-daemon gracefully and then exits.

Cross-DC Vault

As a side note, since that has regularly spawned questions on the mailing list and tickets - the replication has to be done before initializing vault. Otherwise each dc will have a different key and their own leaders, which - if the above approach is being used - causes all replication daemons to start replicating from the first dc they check (since each dc will have a leader), resulting in the original problem. Just messier.

When you have the replication set up initializing one vault node should bring the required k/v’s to the other dc’s, returning that they’re now initialized and allowing you to unlock them.

I want to note here that it is easier to use a patched consul-replicate which allows to define a prefix and to exclude the prefixes you don’t want to replicate, namely vault/sys/expire - though I’m not so sure if non-leaders are actually attempting to execute expiries.

Conclusion

In the end I don’t think this is a satisfying setup. It works, it is somewhat reliable, it doesn’t cause lost secrets unless your leader is constantly switching dc’s. However I’ll probably switch to a backend which supports cross-dc replication natively.


  1. Without risking to lose information. [return]
  2. While it would be possible to push the changes to remote locations, it is generally a better idea not to get even more delay into replication.

    [return]
  3. Though this isn’t feasible for something like secrets. [return]