Overload protection for C++ servers.
Every server has a limit to how much work it can do at one time. The limit is seldom known in advance, and it does not hold still. It falls when the work grows heavier, when a service the server depends on slows down, when the machine is occupied with something else. A server that takes no account of this limit, and accepts whatever arrives, will under enough load accept more than it can finish. The unfinished requests accumulate. Each new request waits behind the ones already piled up, and waits longer than the one before it. The server spends its effort tending the pile instead of answering, and at the worst moment, when it is needed most, it answers no one — not even the clients whose requests it had ample capacity to serve.
The remedy is for the server to refuse the work it cannot finish, and to refuse it early, before the request has been read and before any real effort has gone into it, so that the price of refusal is small and the capacity preserved is real. This is the whole of the matter, and it is harder than it first appears. The server must know its own limit, which changes from one second to the next. And it must apply that limit on every request it receives, cheaply enough that the measuring does not itself become a burden.
Limen answers a single question, and answers it once for each request: may this request be admitted now? The answer is drawn from the server’s own recent behavior, and in particular from how long the server has lately been taking to do its work.
The heart of the library is an algorithm that Netflix designed and named Gradient2. Once in each interval — about once a second — it compares the time the server is taking now against a longer-running average of the same measurement. When the recent figure rises above the average, the server is slowing down, which is the first sign that it has taken on too much, and the algorithm lowers the number of requests it will allow at once. When the recent figure stays in line with the average, the algorithm raises that number, a little at a time, to find out whether more capacity is there to be used. Any request that arrives while the server is already at its allowed number is refused at once, before it has cost anything to speak of.
A server seldom does only one kind of work, and the kinds are seldom of equal importance. When writes surge, a server may still wish to keep answering reads; when a large batch job arrives, it should not be allowed to take the whole of the server’s capacity and leave none for ordinary traffic. Limen can divide its single limit into named shares, one for each kind of request, and hold each kind to its share. While the server has spare capacity the shares are only advisory, and a kind that is not using all of its own share leaves the remainder to the others; once the server is full, each share becomes a firm limit, and the kinds that matter keep the capacity reserved for them.
Limen also carries a second, separate guard, for work that has been accepted but must wait its turn in a queue. This guard is the Controlled Delay method, CoDel, defined in the internet standard RFC 8289. It watches how long items sit in the queue, and when the shortest time any item has spent waiting stays above a chosen target for long enough, it begins to drop items rather than let them grow stale. The two guards address two different failures. Gradient2 keeps the server from accepting more concurrent work than it can carry. CoDel keeps work that has already been accepted from waiting so long that answering it is no longer worth the trouble. A server may use either guard alone, or both together, and a server that uses both is protected at both points.
When Gradient2 finds that more capacity is available, it raises its limit by a fixed step. Netflix’s library raises it by four, whatever the limit happens to be. Limen raises it instead by roughly the square root of the current limit — by ten when the limit is a hundred, and by a hundred when the limit is ten thousand.
The reason concerns the moment that matters most, which is the moment a server has just been throttled and is recovering. It should recover quickly, because for as long as it holds its limit low it refuses clients for no good reason. Netflix’s step of four adds the same amount whatever the limit, so the larger the server, the smaller that step is in proportion to its limit: four is four parts in a hundred when the limit is a hundred, but only four parts in ten thousand when the limit is ten thousand. The square root keeps far closer pace with the server — a tenth of the limit at a hundred, a hundredth at ten thousand — so a large server recovers much faster under Limen than under the fixed step.
Yet the square root grows more slowly than the limit itself — it is sublinear — and that is the reason for choosing it. A step that grew in direct proportion to the limit would recover fastest of all, and would also be reckless, for a single such step can carry the limit past the server’s true capacity and set it oscillating above and below. The square root takes most of the faster recovery and leaves the recklessness behind. Netflix’s fixed step remains available to anyone who prefers it.
Limen runs inside the server’s own process. It is not a separate service and adds no network hop; it reads its own counters and is consulted on whatever code path the application chooses. It depends on no particular RPC library, and asks only two things of the one it is wired into: that the library allow a request to be refused before its body is read, and that it allow a small handle to be carried from that early point through to the place where the request finishes. These two facilities are common, and gRPC, Thrift, and Cap’n Proto all provide them.
Refusing a request is deliberately cheap. When the limiter says no, the cost is a few of the lightest operations the processor offers, and then a return; no memory is set aside and no reading or writing is done. Admitting a request costs the same, with the single reading of a clock added to it.
Every signal Limen records is published through OpenTelemetry, the open standard for collecting such measurements, and is gathered on the collector’s own schedule rather than on the path of any request. An operator can therefore watch what Limen is doing — the limit it has settled on, the requests it has refused, the items CoDel has dropped — without that watching slowing the server down.
The adaptive limiter at the center of Limen is a port of Netflix’s concurrency-limits, which Netflix released in 2018 under the Apache 2.0 license. Limen carries the same license, and every file taken from the original keeps Netflix’s copyright notice and credits the source. The CoDel filter is not a port. It was written afresh, from the standard, and is our own work.
Limen is one piece of RowKeyDB, a database that speaks the Cloud Bigtable protocol and runs on any cloud or bare metal. We built Limen to keep the RowKeyDB server steady under load, and we held it to the standard we hold the database itself to. It was designed before it was written. About ninety-five percent of its lines are exercised by its tests, among them traffic simulations that drive it as real load would; every test is then run a second time under tools that watch for faults in memory and in the use of threads. Everything it does can be measured from the outside.
When we set out to build Limen, we looked for a C++ library that already did this — adaptive concurrency limiting and active queue management, both in the server’s own process and usable by any application — and we found none. So far as we know, it is still the only library of its kind. The problem it solves belongs to every server, not only to ours, and so we have made it open source.