The single-node beta is released

A little later than I meant, and after a great deal of work, the first public beta of single-node RowKeyDB is out. This is the place to say plainly what it is, what it is not, and what it does.

What it is

RowKeyDB is a key-value store that speaks the Google Cloud Bigtable protocol and runs on your own machines — on any cloud, or on bare metal. So far as I know, it is the only Bigtable-compatible store you can run outside Google Cloud, and an application written for Cloud Bigtable talks to it without a line changed.

I do not make that claim lightly. I have spent twenty-five years building distributed systems, more than seven of them as a site reliability engineer at Google, and I am one of the principal authors of Google’s open-source Bigtable emulator — of which RowKeyDB is the production-grade evolution.

The need it answers is an old one. Close to five thousand companies still run Apache HBase, the open-source Bigtable. It works, but it is heavy to operate: it wants HDFS beneath it and ZooKeeper beside it, it runs on a Java virtual machine whose garbage collector stops the world at the worst moments, and its crash recovery replays a write-ahead log across a distributed filesystem and takes minutes. RowKeyDB keeps the data model and the API and sets all of that aside. There is no HDFS and no ZooKeeper; each node keeps its data on its own SSD and recovers in seconds; it is written in C++, so nothing pauses to collect garbage. For an HBase shop, Google’s own HBase-to-Bigtable adapter should bridge the gap with no change to application code.

It runs on one node, but it is not a toy

It is built to production standards, with the plain caveat that this first release is a beta. The distinction to be careful about is between availability and durability, because they are not the same thing and the single-node case depends on it. A single cloud virtual machine has the availability of a single machine: it can be rebooted or retired, and that is around 99.5%, which is enough for the many workloads that do not need more. Durability is another matter entirely. RowKeyDB is built to run on modern replicated SSDs — AWS io2, Google pd-ssd, replicated NVMe-over-Fabrics — which copy every byte across several fault domains and are rated at five nines of durability, roughly one loss in a hundred thousand disk-years. RowKeyDB’s durability is the disk’s. In its durable mode, a write it has acknowledged has already been flushed to the log, and it will survive an unannounced kill of the process. That guarantee is the first design constraint rather than an afterthought, and it is checked by tests that kill the server in the middle of a write and confirm on restart that nothing acknowledged was lost.

The numbers

I measured it on data made to resemble what it will hold, rather than data chosen to flatter it: rows of about a kilobyte shaped like OpenTelemetry signals — a few attributes of high cardinality, such as the address of a client, and many of low cardinality, such as the outcome of a call — which compress about 2.8 to one at Zstandard level 6, as such telemetry does. The whole dataset was twice the server’s memory, so that most reads reach the disk rather than being answered from cache, and the machine had separate disks for the write-ahead log and the data, each provisioned for 40,000 IOPS, which is an ordinary production arrangement.

On that machine it sustains 30,000 reads per second and 10,000 durable writes per second, both at a 99th-percentile latency under ten milliseconds, while refusing nothing. Driven harder on reads, it saturates the 40,000-IOPS disk at 40,000 reads per second, where the latency rises to seventeen milliseconds and it begins to refuse about three percent of requests rather than accept more than the disk can serve. That refusal is deliberate — it is the work of Limen, the small load-shedding library built into the server — and it is why the database stays up under more load than it can finish. The ceiling at that point is the disk’s, not the database’s: an attempt to push the reads to fifty thousand a second ended not in RowKeyDB faltering but in the benchmarking tool, which ran out of memory and was killed while the database went on serving.

	per second	p99 latency	refused
Reads	30,000	5 ms	none
Reads (disk saturated)	40,000	17 ms	~3%
Durable writes	10,000	9 ms	none

The write rate is the smaller and the separate figure because it is set by how quickly the log can be flushed to the disk, not by how many operations the disk can perform. Raising it on a single machine is a matter of a faster or a parallel write-ahead log, which is among the work ahead.

What comes next

The large piece of work now is the multi-node version, which will scale storage and performance roughly in proportion to the machines given to it, and which is meant to need no separate coordination service to run. Alongside it I will keep fixing and tuning the single-node version toward a general release.

Getting it, and trusting it

The beta, for x86-64 Linux, is on the release page, with the benchmark configuration set out beside it. Every release is signed, and you should confirm that a download came from me before you run it. The signing key’s fingerprint is:

A56D 0CCF 7F9E 66BD F5E9  7CAF 54BD 1DFD F41B D841

You can fetch the key by address and verify a download against it in three commands, set out on the security page. The source is private during the beta.

Try it against your own workload, and tell me what breaks.