Scaling smoothly: RevenueCat's data-caching techniques for 1.2 billion daily API requests

At RevenueCat we serve more than 1.2 billion requests per day. You can only effectively do that if:

You distribute the load among many web servers.
You use cache to speed up access to hot data and protect the capacity of the backend systems and data stores.

The cache system is composed of several servers with good amounts of ram and high network capacity. They store the data in memory or flash for fast retrieval. They are usually key-value, and the most widely used is memcached. To keep them fast and simple, cache servers typically are “share-nothing”, a simple key-value store with no dependencies on other systems. They belong to a logical “pool”, but it is the client who chooses which server to talk to for storing or retrieving the data. Clients often use a hashing mechanism to shard the different keys among cache servers, to distribute the load and data evenly across the cache servers.

The cache is critical, and needs to achieve three things:

Low latency: It needs to be fast. If a cache server has issues, you can’t retry. If a server is unresponsive, you can’t try to open a new connection. If you do so, the web servers will get stuck processing requests while new ones keep piling up at a rate of tens of thousands per second.
Up and warm: It needs to be up and hold the majority of the hot data. If you lose it, it would surely bring down the backend systems with too much load.
Consistency: It also needs to be consistent, i.e. never hold stale or incorrect data.

To achieve all these you need to be really careful operating and designing your system. We are going to cover here some of the things we do to achieve a fast, reliable and consistent cache at RevenueCat. A lot of the techniques mentioned in this article are supported by our Open Source meta-memcache cache client. Our client supports the new meta-commend memcache protocol that empowers some very advanced features. We had to develop our own client since none was supporting the new meta protocol and we needed those advanced features.

1. Low latency

Pre-established connections

Opening a TCP connection is slow for cache operations. The TCP handshake adds 2-3 additional packets and one additional round trip to the cache server, plus TCP does retries and waits if things aren’t working as expected.

Cache servers are often limited by their network throughput, so reducing the number of packets is very important for their capacity.
The cache responses come from memory, they are very fast (in the order of ~100us) so the network round trip (~500-700us for the same AZ) will be the dominant factor in response times. Adding another roundtrip to establish a connection almost doubles the cache response time.

Our cache client manages a pool of open connections. You can configure how many to establish on startup and the maximum number of connections to keep in the pool.

The connections are borrowed by the application for executing a request, and returned to the pool when the request is over.

During high usage, when connections are mostly out of the pool, more connections might be opened. And if when returned the pool is full they are torn down.

You will need to find the best setting to balance the number of connections you will make to the cache servers and the usage of cache during peaks, to avoid having to create new connections often.

Fail fast

In an ideal world, your server always has spare cpu cycles, an uncongested network and the cache server equally has the resources to handle your request. In practice, while you monitor capacity and scale up as needed, it is not always the case.

This means that sometimes a cache server will become unresponsive. It is often small glitches, a network blip, a brief spike. Retrying the cache operation will often work but that is extremely risky for cache. Sometimes it does not recover quickly, and that is the time when that innocent retry can bring down your entire serving infrastructure.

Consider the following scenario:

Say that you have 1000 requests per second on a server. You serve most, say 95% out of cache and only 5% use the DB. The requests that hit only the cache take 10ms, and the ones that use the db 50ms, so your average response time is 12ms. On average the server will have 12 concurrent requests in-flight.

If a server becomes unresponsive, you might consider retrying the request… You have two options:

Retry immediately:
If you retry the operations immediately, you will be doubling the request rate to that problematic cache server. If the server was failing due to load, you will totally ensure it keeps dead for good with no recovery in sight. I’m not even mentioning if you try to retry more than once :). Also it is most likely that if there was some transient problem, you will hit the same problem if you just retry immediately. Retries usually need to wait a bit before being attempted.
Retry after a small sleep:
Let’s say you wait a reasonably-low 100ms. Chances are that some requests hit the same cache server more than once, but let’s assume that only once. Let’s also assume that we are (very) lucky and only 25% of the requests are affected and have to retrieve data from that cache server. The latency will be increased by 100ms*25/100 = 25 ms on average. This means that our original latency has now tripled to 37ms. This also means that you will need 3x the server capacity. You might run a bit overprovisioned to handle spikes, but most certainly you can’t sustain 3x the traffic suddenly. You might point that just doing a sleep doesn’t really use cpu, but the number of concurrent requests to the server will triple, more memory will be needed to handle them, the cpu will be busier multitasking over those connections… Even the database will get longer connections when requests start to slow down, Not to mention the increase in load if the retry is also unsuccessful. If the DB gets slower, that will slow requests too, and the death spiral goes on making things worse and worse… A single 100ms sleep for a retry, under heavy load, can bring your entire server fleet to its knees.

So… what should you do if a cache server becomes unresponsive and fails a request? Fail fast, assume a miss and continue, never retry.

Even more, you should “mark it down” for some period and not attempt new requests, not attempt new connections. TCP is built for reliability so there are timeouts, retries and waits built in the connection establishment that can act as a hidden “100ms sleep”.

In summary, how do we achieve low-latency?:

Low timeouts: Configure low connection and receive timeouts, so you will quickly assume a server has issues instead of getting stuck for long waiting for a response. Cache latencies are really stable, with low P99s, thanks to servers using only RAM, so you can be aggressive and use as low numbers as 100ms. But it doesn’t matter, as we have seen, those are still a long time to wait for cache, so you will need more to deal with failures.
Fail fast & Mark down: On failure, clients should mark the server down for a few seconds. If there are still healthy connections on the pool they can be used, but no new connections shall be attempted while marked down, and requests should immediately fail if there are no healthy connections.
Take failures as a cache miss, the application will fallback to the source of the data.

If you are wondering what about cache writes?, you are very right! We will cover that under the third section, about keeping the cache “consistent”.

2. Up and warm

Servers can’t stay up all the time. You need to assume they will fail and account for it.

Also you need to keep your cache pool warm, with most of the hot data. You need to monitor the hit rate, ensure there is enough capacity for all the hot data. Restarting servers means losing the data (as it is ephemerally stored in memory), so this imposes a lot of restrictions on how you operate your cache server fleet.

Lets see what things we do at RevenueCat to keep cache up and warm.

Plan for failure

Servers will fail, so, how can you minimize the impact when one fails? You can add a lot of cache servers. The more cache servers you have, the less impact they cause if one goes away. But having too many can bring costs up and be wasteful. You can try using many small servers:

That definitely helps reduce the impact of a lost server.

But small ones suffer more from hot keys. When a key is disproportionately requested, the server that gets that key gets much hotter than the others. When the servers are big and there aren’t so many, the hot keys do not represent a huge deviation in the overall load. But with lots of small servers, you will see bigger deviations in load that might cause saturation and issues:

Deciding on a good number and sizing will depend a lot on your needs in terms of capacity, access patterns, traffic, etc.

In general you should understand your backend capacity and design the cache tier such that you can tolerate at least 2 cache servers being down. If your backend is sharded, make sure that the sharding for cache and for the backend are orthogonal, so a cache server being down translates into a moderated increase on ALL backend servers, instead of high load just on one.

Fallback pools

Cache servers handle a lot of traffic, so sometimes scaling the backend for supporting 2 cache server failures is a lot of overscaling. There is the option of creating a “fallback” cache cluster. If a server fails, the client will retry on this fallback pool.

There are two types of fallback pools:

Mirrored pool: both pools receive all the writes, they are in sync and warm, so you can just fallback reads to the mirror pool when needed.

The mirrored pools should have a different sharding “salt”, so when a server dies in one of the pools, its keyspace is distributed differently in the second pool. All the servers in the other pool will get part of the key-space. Not changing the salt would double the load on the matching box, overloading it and it might cause cascading issues.

The main drawback of this approach is cost, keeping a lot of data in memory is expensive, so keeping it twice even more. But this model is optimal when paired with running web servers in multiple availability zones (you colocate each pool with the servers in the same AZ). This makes the system very reliable, with each AZ behaving as an independent pod. Most reads are local, in the same AZ, so they are faster and also reduce cross-AZ transfer costs, offsetting some of the costs of the duplicated data.

Gutter pool: it is a small, empty pool, used as a short term store when a cache server fails on the primary pool. You configure it to hold values for a very low TTL, for example just 10 seconds. The hottest data will be cached for that duration instead of hitting the backend on every request, reducing the amount of over-provisioned capacity needed in the backend.

A gutter pool, with low TTLs, won’t be as effective at reducing backend load as a mirrored pool (that is fully warmed up and uses long TTLs). But it doesn’t need to be kept consistent doing double writes, so it is simpler to operate, can be much smaller and it is very cost-effective.

Our meta-memcache client supports gutter pools out of the box, and allows you to implement mirrored pools writing your custom Router.

Dedicated pools

Memcache is very simple. All the data belongs to the same keyspace. Their memory is split in chunks, called slabs, and each slab is dedicated to data of some fixed size (this is done to avoid fragmentation). When the memory available to memcache is exhausted it will need to free memory to accommodate the new incoming data. It will try to respect the Least Recently Used (LRU) policy, and invalidate only keys not used, but it is not magic. Sometimes it needs to change a chunk from one capacity to another, clearing the full chunk. An influx of new data can evict data from cache before the desired TTL is reached. And even sometimes data that is very expensive to get is not frequently access, so is pushed out by other data more heavily requested.

In short, there is not a lot of control over what data is kept in cache. There is no way to prioritize or anything advanced, and sometimes you need to keep certain important datasets warm, especially if they are expensive to recompute or they are counters that can lead to inaccuracies if evicted often.

The best way to do this is by creating dedicated pools of memcache for certain usages, so you will ensure certain capacity is always dedicated for important use-cases. You should always monitor the hit-rate of each of your use-cases and use that information to expand cache pools or create dedicated ones.

The only drawback is that the more cache pools, the more connections each web server needs to keep open to each of the servers on each of the pools, and connections take memory and other resources. At some point you might need to look into using memcache proxies to reduce the number of connections opened.

Hot keys

We mentioned how some keys can be disproportionately hot. The most typical example is when you have some configuration that you fetch in every request, some rate limiter, or the auth info of a really big customer that is fetched on every one of their requests…

In extreme cases, a single key can be requested too much for a single memcache server to handle it.

There are several techniques used in the industry:

Key splitting: Consist in having many versions of the same key, “sharding” the key. For example keyX will become keyX/1, keyX/2, keyX/3, and so on, and each will be placed in a different server. Clients will read from one (usually deterministically from their client id) but write to all to keep it consistent. The hard part of this system is how to detect the hot keys and build the pipeline to make all clients know what keys to split, how much to split, and coordinate that they all do it at the same time to avoid inconsistencies. Plus you will need to do this quickly because sometimes the hot keys are triggered by real life events or trend. The list of hotkeys is not static!. If you want to implement this in our cache client let us know and we can discuss some strategies here.
Local cache: A simpler mechanism is to detect hot keys and cache them locally on the client. You can only do this for data that changes rarely, since the local cache won’t provide proper consistency, but often with low TTLs and choosing well what keys are allowed to be cached locally you can find an acceptable tradeoff. Remains the problem on how to detect what is a hot key. For that, the new memcache’s meta-command protocol comes to the rescue. It supports returning the time a key was last accessed and you can implement a probabilistic hot cache. If you see that the key’s last access time is less than X seconds ago many times, it means it is hot. This is fully supported by our meta-memcache client and we use it successfully to reduce 25% of the cache workload by caching just ~100 really hot items on each instance. The local cache needs no coordination across all the servers, unlike key splitting.

Avoiding “thundering herds”

If a key is very hot and expires or is deleted, all web servers will get a miss and try to get the value from the backend at the same time, causing big load spikes, increased latencies and saturation that can cascade back to the web serving tier. This is called a “thundering herd”.

At RevenueCat we usually maintain cache consistency by updating it during writes, which helps reduce thundering herds. But there are also other models for caching:

Low TTLs: You use a relatively low TTL to refresh cache periodically. This can be helpful for non-user data like configuration for example.
Invalidations: You stream changes from the DB, for example, and invalidate the values in the cache.

These two models can cause a lot of problems with thundering herds if the keys that expire or are invalidated are really hot.

To avoid this you need to achieve that only a single client queries the backend and populates again the cache, not all of them at once. For this you can use two mechanisms implemented in out meta-memcache library:

Recache policy: “gets” include a recache policy indicating a recache TTL. When remaining ttl is < the given value, one of the clients will “win”, will get a miss and will re-populate the value in cache, while the other clients will “lose” and continue to use the existing value.
Stale policy: In the delete command you can opt to mark the key as stale, triggering the same mechanism as above: A single client will get the miss while others keep using the old value.

There is a third case of thundering herd: when there is an eviction of a highly requested key. This should be rare thanks to how memcache tries to evict always the least recently used (LRU) keys, but sometimes it can happen (the server holding it is restarted, etc…) The point is that when a heavily requested key goes missing it can also cause all web servers to hit the backend at once. Our client also has implemented the Lease policy for this situation. Like in above examples only one client will “win” the right to repopulate the value. The problem is that the losers don’t have any stale value to use in the interim, they just have to wait for the winner to re-populate. And we have already talked about the risks of adding waits to high qps code paths. Certainly, it is equally bad hitting too hard on the backend, so this can still be useful. But use it only if you are experiencing the issue and understand the implications.

Resharding

Sometimes the cache cluster runs out of capacity. The hit ratio starts to drop with a newly added use case that is evicting other data from the cache, or just the workload grew bigger.

Usual way to fix this is by adding more servers, but you need to be very careful with the way the clients shard the data to do this. Depending on the sharding mechanism, you might reshuffle all the keyspace, basically making all the cache ice-cold.

To avoid that you need to use a consistent hashing algorithm, that will maintain the places of the majority of the keys, and only change the % of servers you add. In particular we use uhashring (mostly due to legacy reasons, it was used by our old cache library, but it works very well). If you need something else you can contribute to the project!

If you want to learn more about how consistent hashing algorithms work, there are plenty of good articles about them.

One last thing to remember is that while consistent sharding makes adding servers safe, adding and removing them later is not a safe operation, since you will return to original places with stale values. It will introduce inconsistencies in your cached data!

Migrations

Sometimes it is not about adding servers but replacing them. While this can be done one-at-a-time, leaving time for the cache to warm up after every change, it is extremely time consuming and it can lead to mistakes.

We once got a maintenance notice from our cloud provider stating that they needed to restart our cache servers within one week for maintenance, so we opted to build a migration strategy in our cache client.

Migration strategy for our cache client.

It basically executes a client-driven smooth migration process:

Starts warming up the destination pool, populating it first by “mirroring” all the writes. Writes will not be sent to both pools.
Then populate some % of reads. The value read from the origin cluster is sent to the destination cluster, so we warm up this entry.
At some point, when the destination pool is already warm enough, we can switch to read from the new pool while keeping writes sent to both. This keeps both pools consistent and the migration can still be reverted if the destination pool wasn’t warm enough and is causing overload issues.
And then finally just use just the destination pool, at which point you can decommission the origin pool and consider the migration done.

The way this is scheduled is via the configuration the migration client receives: a map of migration mode and start time for the stage of the migration by timestamps, to ensure all servers act coordinated and change behavior at the same time. Note the time across all boxes needs to be synchronized properly to keep time deviations in the ms range to avoid issues.

To make sure the consistency is kept high, reads are only populated with “add” (set only if not present), so there are no races with concurrent writes. It also uses the no-reply mode, it does not expect nor wait for response to avoid adding extra latency on cache reads.

Some non-idempotent actions like those for counters and locks are not replicated as they can’t be kept consistent. Those usually don’t need to be warmed up anyway, they transition to the destination cluster when reads are moved there.

This migration client has helped us replace full clusters with 16 cache servers, with millions of RPM, within just 2-3 hours, keeping hit rates high, with low database impact and without measurable impact on the end users.

3. Consistency

There is the saying “There are only two hard things in computer science: cache invalidation and naming things“, often attributed to Phil Karlton.

There are two hard things in computer science... Cache invalidation and naming things.

While it intends to be funny, if you ever had to work on cache consistency you will know it is indeed hard.

You likely have many web servers handling traffic concurrently. But even if it was just one web server, it can still execute requests concurrently in multiple cpus. This unfortunately means that there will be races, which can introduce cache consistency problems.

One trivial race that can lead to a consistency problem is:

In the above example, the cache starts empty and:

Web server 1 tries to read cache, gets a miss, falls back to the DB, reads “red” and tries to refill “red” in the cache.
Web server 2 happens to do a write, to set the value to “green”, updating both DB and cache.

Depending on the order of the cache writes, the state of the value in the cache can be different. If the value doesn’t match what is the value in the DB, it means there is an inconsistency.

You might say that this is easy to solve, just make the cache refill use “add” instead of “set”. “Add” will fail if the cache is not empty. It will definitely work in the above example, but there are many other corner cases:

The web server 1 cache write might fail, timeout, be lost, or the server might die, never issuing the cache update… Then the refill “add” can succeed, leaving cache stale.
We might have db replicas with lag that introduce races among refills.

Our meta-memcache library allows a lot of control of the low-level meta commands, which helpful for dealing with consistency and high-throughput usages:

compare-and-swap: to detect races while writing values. During the read you get a token and send the token with the write. If the value was modified since the read the token won’t match and the write will fail.
leases: so only one client is granted the right to update the cache. Memcache places a mark on a miss and other clients know that there is another client populating the cache instead of race between them.
recache policies to implement stale-while-revalidate semantics. Once a client will get to update the cache while the others use the stale value.
Mark stale: Instead of deleting a key, you can mark it stale and one cache client will get to update the cache, ensuring the cache is re-validated but without the thundering herds of deletes.
Lowering TTLs: With “touch” you can adjust the TTL of keys (without having to read and write back the value), so you can ensure a key would expire soon.
Write failure tracking: to keep track of write errors.

We are not going to cover them all, but just introduce the most important strategies we use for keeping our cache consistent:

Write failure tracker

A write failure almost always means that there is an inconsistency in the cache. A value that you wanted to write wasn’t written, so the state of the cache is unknown, but likely wrong.

As we discussed above, retrying cache operations while handling the cache can produce all shorts of performance and cascading issues.

Our strategy is to keep the fail-fast approach, but record what keys fail to be written. Our cache client allows you to register a handler so you get the stream of write failures. We collect those, dedup them, and invalidate the cache at least once for each of the reported keys (retrying as many times as needed until we succeed). This ensures that the value will be freshly populated from the cache, and reach consistency.

This simple mechanism allows us to consider cache writes as “always succeeding”, which hugely simplifies the possible scenarios when dealing with cache consistency during CRUD operations, as we will see below.

Consistent CRUD operations among two stores

During a write you need to update both DB and cache to remain consistent. While databases usually provide the concept of transaction, when two different stores are involved you won’t have that luxury and guaranteeing consistency gets complicated.

We have implemented CRUD strategies to access data. They implement highly consistent caching mechanisms and can be reused easily, just configuring the behavior, source of data, etc. We highly recommend building abstractions for CRUD access that abstract away the nuances of updating DB & cache, so product engineers can focus on the business logic, and these strategies are heavily battle-tested and safe to work with.

Let’s see how we implemented the CRUD operations for a highly consistent cache:

READ:

We try to read from cache. On miss, read from DB and refill the value into the cache.

To refill, we just always use “add” to prevent races with concurrent writes.

Races between concurrent refills, even when using replicas with lag, are not a problem. If the read value is stale, it is because the value has just changed, so there will be a cache write shortly to fix it. If that write fails, the write failure tracker will ensure affected keys are invalidated eventually.

There is still the possibility of a cache write working, the key instantly expiring, and a refill from a still stale read replica populating the cache. For these scenarios the options are:

Embed the cache keys in the DB transactions (postgresql allows to write user values in the WAL, in mysql some embed metadata as query comments). Then have a tailer of the WAL/binlogs invalidating the keys after the replication to the read replica has succeeded.
Make all updates trigger a delayed invalidation, delayed more than your replica lag, similar to write failure tracker.

Fortunately the lag of our replicas is <100ms, the chances of the cache writes being evicted in that period are minimal, and we didn’t need to implement any of those.

UPDATE:

You need to update DB and cache with the change. What are your options?

If you write to the db first and then to the cache, the cache write might fail altogether:. Even our write failure tracker might fail. The server can get struck by a lightning bolt after issuing the DB commit!
You can reverse the operations, but now if db write fails, the cache has the new value, while the DB doesn’t, so it is also inconsistent.

The strategy we implemented is doing cache operations before and after:

Before: Reduce cache TTL for the value to something low, like 30s
Write to DB
After: Update cache with the new value (increasing also TTL back to normal)

Let’s now consider the possible failure scenarios:

Fail between 1 and 2: nothing happens, the change doesn’t make it to the DB, so the cache is consistent. The data will still expire due to the “touch” reducing TTL and will be repopulated, but that does not introduce consistency problems.
Fail between 2 and 3: the cache is not updated after the write to the DB. It will be stale for a bit, but thanks to the initial “touch” reducing the TTL, it will expire quickly and the new value will be populated.
We also record the write failures (and reducing TTL to low values is also considered a write failure) so if the cache writes fail themselves, due to any reason, we will proceed to invalidate the affected key as soon as the cache server becomes available, achieving consistency.

In summary, reducing the TTL before the DB operation is a simple yet effective strategy for implementing highly-consistent updates (paired, of course, with the write failure tracker to ensure cache writes can be relied upon).

CREATE:

For “creates” you can develop a similar strategy, using leases with low TTL before the db write. The lease will prevent any stale data to stay in cache if the create operation doesn’t complete.

We did not need to implement this, since our ids come from the DB and that is the one that provides the serialization needed to avoid races. We can just use a simple “add” operation after the DB write, knowing the id is unique.

DELETE:

We also reduce the TTL before the DB operation and issue the delete after. For optimizing latency the last delete doesn’t need to be synchronous and wait for response, since the TTL reduction is awaited and it is guaranteed to be recorded as write failure, and retried if it fails.

Our application use-case doesn’t have races on deletes, so we can just issue a simple delete operation. But deletes leave just nothing in cache, and that can race with refills (especially if you have DB read replicas with non-trivial replication latency). So, depending on your use case, you might need to leave a “deletion mark” as value (interpreted as a miss or negative lookup) with some low TTL to avoid those kinds of races.

Conclusion

We have shared some of the techniques and strategies we use to operate a highly performant and consistent cache infrastructure. A lot of those can be implemented out of the box with our Open Source cache client for python, or can be adapted and used to other languages as well. We hope this can be helpful if you are facing similar challenges or you are just interested in learning the kind of effort needed to keep the RevenueCat platform reliable for our customers.

If you love to build complex, distributed and reliable systems don’t forget to check our open positions!

Scaling smoothly: RevenueCat’s data-caching techniques for 1.2 billion daily API requests