Reliability at RevenueCat
I think we’ll build a more reliable IAP service than anybody else can. Here are the preventions and mitigations that will help us achieve this.
Computers are hard. Miguel and I were never more than a meter away from a laptop for the first four years of RevenueCat for this reason. Using RevenueCat in your app means putting a lot of trust in somebody else’s computers to keep working. Because of this, we’ve always invested a lot into reliability. We’ve even been accused of moving too slowly on product development, and I think this is partly because we spend a lot of time making sure we don’t break the computers.
We think about reliability in two primary dimensions: prevention and mitigation. Prevention is making sure the computers don’t break. Mitigation is making sure when they do break (and they will) the damage is limited. While prevention is ideal, we’ll never reach 100% reliability. RevenueCat exists in the real, distributed world, and as was evidenced last week when Amazon knocked out both our primary databases, there will always be things we don’t expect.
In this post I’d like to talk about what mitigations we have, what happens when our API is inaccessible, steps we’re taking on prevention, and what additional measures we have in the works.
While prevention is never fully achievable, compounding best practices can significantly reduce the number and severity of incidents. There are both macro and micro level investments we’ve made in prevention, and I want to iterate some of them here:
- We have a six-person and growing infrastructure team whose entire raison d’etre is to make sure the computers are working. This represents a quarter of our engineering budget fully dedicated to infrastructure and reliability. They handle not just the care and feeding of our machines, but monitoring, deployment pipelines, DevOps, performance, scaling, simplification, everything that is needed to ensure we can safely scale and change the RevenueCat system. This team is led by our Head of Infrastructure who has a decade of experience scaling Facebook’s system.
- Automated testing, code reviews, testing and staging environments, canary deployments, easy rollbacks and more. We’ve invested a ton in the best practices of change management. When we mess up (and we do) we’ve made it really easy to catch it before it’s made it to production, and when it does, we’ve built a lot of tooling that makes undoing that deployment easy. These things come with overhead, but it’s an important tradeoff of some velocity for reliability.
- We have a rather mature on-call program and incident response culture. Gone are the days of Miguel carrying a laptop everywhere he went. We have a Slack channel #binary-solo, where, when the bits hit the fan, we can all gather in an emergency room and work the issue in real-time. The team navigates extremely consequential, time-sensitive decisions collaboratively. We’re always in that room when something terrible is happening, but we maintain a cool collective head. We also have a great culture of post-mortem analysis, analyzing our failures and extracting action items that will prevent the same issue from recurring. One thing to work on here is making sure we prioritize action items. This is something we can improve.
- We maintain a reliability roadmap that includes major and minor investments that we think will be needed as we continue to scale. These can be things like code level refactors of how we access storage layers (Data Access Layer), or planned database migrations (that don’t always go well), or migrations of our cache layer (for which there is no link, because it went well), as well as simple things like smarter monitoring that allows us to understand what’s actually happening with the system, and reducing our average response time by 50% over the last 2 years.
- With our initial SOC2 audit in 2021, we made significant improvements to not just security, but also general organizational hygiene that I believe lessens the chance of a catastrophic breach or incident.
These are just a few of the highlights, but not a week goes by that we don’t make some change that’s meant to lessen the likelihood of a major failure. As was discovered last week, there will always be single points of failure as long as we’re relying on a single database and database provider. Moving away from this world is on our roadmap, but it’s not a small task and likely isn’t the most impactful investment we could make right now. We will never reach perfect reliability, which is why for every ounce of prevention, we should meet it with an equal measure of mitigation.
Mitigation is the effort we take to ensure the downsides of an inaccessible RevenueCat API are minimized. Because RevenueCat primarily serves mobile devices, our API can be inaccessible for any number of reasons: bad connectivity on mobile devices, RevenueCat incidents, or even intentional blocking. Our SDK is designed with this in mind and has several fail safes built in. Let’s talk about what happens when the SDK can’t reach our servers:
- The majority of paid users, i.e. users with a subscription, won’t even notice. The SDK locally caches the most recent entitlement response from our backend and even uses the response time of that request to validate the end of a subscription period, so even if a subscription lapses while the API inaccessible, entitlement will not be revoked.
- While the server is inaccessible, the most affected population are users who are trying to purchase something new. There are two cases that can happen here depending on the state of the SDK cache:
- User has no cached Offerings – In the case that the users device was never able to reach our servers, they will have no mapping of Offerings and Products. This will typically manifest as a failure of the paywall to load (depending on your error handling logic) and prevent them from purchasing at all.
- User has cached Offerings – In the case that a user had previously opened the app with the API reachable, they will have a local cache of Offerings and Products. In this case, they will most likely be able to see a paywall and make a purchase. The SDK will transact with the underlying store, but won’t be able to post that receipt to RevenueCat for verification and entitlement unlocking. This is the worst case scenario, but the SDK will hold onto the receipt and keep trying to post it to the server. Once the server is reachable, the receipt will post and entitlement will be granted retroactively.
For most apps on most days, this significantly mitigates the impact of downtime. However, case 2.2 is one we are working on. We have several ideas, from a special “party mode” for the SDK that when we flip a flag at our edge layer, we allow the SDK to defer verification of purchases until the service is back up and running. However, this also requires careful consideration as it makes it easier to exploit our SDK. This may be something we address alongside our SDK security update later this year. For case 2.1, most of our offerings information is kept in the edge layer as well, so in the event of database issues, we could continue to serve these requests. However, without a solution to the 2.2 scenario, this would net degrade the experience. Improving these failure modes is on our roadmap for 2023. On our short term roadmaps are improvements to both scenarios that smartly load sheds less important requests first to minimize the impact of these cases further.
The other major consideration during downtime is what happens to events, data pipelines, the dashboard, etc. For us, these fall into a second tier, while we know you all love refreshing that MRR dashboard, these services are less mission critical. This doesn’t mean we don’t care, but when we are choosing how to invest our limited reliability tokens, we invest appropriately. What happens to these services really depends on the nature of the incident, but let me talk about the case that our database is unavailable:
- We stop ingesting new data. If the API can’t ingest receipts, we can’t track new subscriptions. However, typically when the database has fallen over, you can’t read the dashboard anyway so who cares. In some cases, we are able to recover receipts we weren’t able to ingest at the logging level, but this isn’t always possible. However, the SDKs do a good job of caching and retransmitting the receipts until they are delivered.
- Dashboard inaccessible. Our SDK and Dashboard all rely on the same central service so typically the dashboard is inaccessible when we have an issue, though no data is lost.
- Events are delayed, but not lost. We will continue generating events, and they will be dispatched with a delay. This typically has minimal impact for 3rd party integrations, but depending on your usage, webhooks being delayed can cause issues. Make sure your webhook consumption logic knows what to do when a webhook doesn’t arrive on time.
- Charts, Customer Lists, and ETL data. These may experience some delay, but often it is minimal if at all.
- Server to Server – If you are calling RevenueCat APIs from your backend directly, you will want to make sure you have good error handling for situations where the API may be unreachable. This isn’t the case for most users.
The way we’ve designed these downstream systems is to degrade gracefully in most cases, as we control them more directly, it’s easier for us to build in mitigations. However, there is more we can do here. One of our more severe incidents in the last year was due to an unexpected component failure (love you Jeff), in our data pipelines. We are in the process of migrating away from our current technology to a new method of streaming data from our primary database that we believe will be more reliable.
A final category of mitigation is what happens when the stores go down. It happens. Right now, we can react to these manually by using our promotional entitlement system, however this isn’t ideal and we haven’t had to use it in a while. On our roadmap are improvements to this system so that an app store outage has limited impact on apps.
One important thing to note here is that much of our mitigation strategy is implemented in our client SDKs. If you are using our API directly, you should be thinking about basic best-practices around reachability. The internet just doesn’t work sometimes, sometimes it’s us, sometimes somebody just unplugged a cord. You will want to think about the “what happens if this fails?” anytime you are making a network call to RevenueCat. Likewise with any event based infrastructure, ask yourself “what happens if this event is delayed?”.
At the end of the day, I think we’ll build a more reliable IAP service than anybody else can. I don’t mean that to be a dig on anyone, it’s just that computers are hard. It’s a complicated, multi-party distributed system and stuff happens. We’ll never be perfect, but it’s my opinion that we’re investing well and I believe that our overall reliability will continue to improve.
I trust me; I hope you will too.