How we migrated our database to Amazon Aurora with zero downtime
Follow along with our zero-downtime migration.
We recently migrated our main production PostgreSQL database from Amazon RDS to Aurora. Though moving to Aurora is a relatively simple process, most of the approaches that came up in our research involved seconds to minutes of downtime.
At RevenueCat we cannot afford downtime since we provide critical infrastructure for in-app purchases. At the time of the migration we were processing over 600 requests per second. Not being accessible for several minutes would be considered a serious outage.
When we launched RevenueCat last year, our production database was running on hardware comparable to a Raspberry Pi. As we grew, we realized our database would become the main bottleneck of our architecture and a potential single point of failure. While we were in Y Combinator, we started getting some traction and we upgraded to a much more powerful RDS database. We also created multiple read replicas to power our dashboard and offline jobs.
As we kept scaling our systems, the limitations of RDS started to become obvious. Our main reasons for switching were:
In case of a disaster, RDS failover (multi-AZ) is DNS based. Though automatic, the whole process can take several minutes, potentially leading to thousands of dropped requests.
Aurora, however, replicates 6 copies of your database across 3 different availability zones, backing up your data continuously. Thanks to this architecture, failovers can be performed in less than 30 seconds, which greatly reduces the chances for downtime.
Performance and scalability
RevenueCat is a high throughput system. We are constantly writing to the database, and our API serves over 40K requests per minute. One of our main problems with RDS was the IOPS (IO operations per second) limit. IOPS had to be provisioned beforehand, which forced us to always have the system over-provisioned. Engineers would need to keep an eye to our read and write throughput constantly. A big jump in API requests per second could potentially degrade our response time significantly. We were still far from our IOPS limit, but we knew it would become a challenging problem at the rate we were growing. On the other hand, Aurora handles I/O automatically, without the need of provisioning IOPS manually.
Another big advantage of Aurora’s architecture is the ability to add up to 15 read replicas with an extremely low latency, which we can use to distribute our read operations.
Timing was just right
As mentioned before, the engineering team was well aware of the bottlenecks of the system, but we still had plenty of time to make the change. The right time to do a big infrastructure upgrade is before you need it. These type of migrations can be stressful and catastrophic when performed under pressure. We wanted to make sure we understood all the implications well, researched alternatives, tested properly and even designed a mitigation plan. In an emergency situation, you lose those luxuries.
The typical solution for migrating to Aurora, which involves downtime, looks like:
- Create an Aurora read replica from your RDS database.
- Wait until Aurora has caught up with your RDS database.
- Promote your Aurora read replica to master, so that it can accept writes.
- Change your production database URL in your apps to your brand new Aurora read/write instance.
The problem here is step number 3. This promotion creates a fork between your databases. The promotion of the Aurora follower takes approximately 3-5 minutes and requires a database reboot. During this time, the data written to your old RDS database will not be reflected in the new Aurora database. This is a particularly tricky problem when you are dealing with relational data and foreign keys. If your database uses auto-increment primary keys, and you are trying to re-insert the missing data at a later stage, it could lead to dangerous discrepancies.
The naive, often suggested solution for this problem is to put the API under maintenance, not accepting any write calls, while the read replica is being promoted. This was not an option for us.
Another solution would be to setup an asynchronous multi-master replication database cluster. Unfortunately, this is not easily achievable in PostgreSQL, and would have required additional engineering and the use of third party tools like Bucardo.
Instead, we decided to exploit a huge advantage of our system: all of our SDK endpoints are idempotent. This means, that even if we send the same POST request to our API multiple times, the result will always be the same, without duplications or data inconsistencies. This allowed us to essentially log any requests sent to the old database after promotion of Aurora, then later re-send them to the Aurora instance.
The migration plan
- Create an Aurora read replica from our production RDS database, and setup CloudWatch alarms and dashboards.
- When the read replica is all caught up:
- Pause all the asynchronous jobs that could be writing to the main database.
- Activate a flag in production to start logging all POST requests (URL, headers, body) into some sort of queue. This is the subset of write requests that happened right before and after the fork. Some of them will also be replicated to Aurora, but that is totally fine since our endpoints are idempotent.
- Promote the Aurora replica. This step will stop replication to allow writes, creating the fork. When fully promoted:
- Create read replicas from the Aurora master, for asynchronous jobs, dashboard, etc
- Duplicate our API web infrastructure, but pointing to Aurora instead of to our old database. Re-issue all the requests in the queue to this new infrastructure, which essentially will replicate all the missing writes in Aurora. We used a script similar to cgarcia’s Making an Unlimited Number of Requests with Python aiohttp + pypeln.
- Up until this point, we have not made any real change to production, and everything is reversible. Once Aurora is caught up, we can just change the database URL of our production environment. We use Heroku, so we can make this change with no downtime. All the writes that are happening right before this switch, are still being logged to the queue, so we will not miss them.
- Start all the asynchronous jobs again.
- Monitor everything: servers performance, database, code exceptions, etc
- Deactivate the logging flag, which stops writing requests in the queue. Turn off the duplicated infrastructure created in 3.2.
The day of the migration
We chose a moment of low traffic, and had the whole team ready for support. We did several dry runs to make sure all the scripts were working as expected and there was no room for human error. I cannot emphasize enough how important these dry runs are. The same way you would not deploy a new product feature without testing it, it seems like a bad idea to test your migration scripts directly with your production environment. Dry runs will act as integration tests and will allow you to detect flaws in your migration plan before it is too late.
We also made sure that the rollback plan was fully tested, in case we had to abort for any reason.
Even though everything went incredibly smoothly, and the whole operation was performed in about 5 minutes, having the team ready to help in case of an apocalyptic outcome felt like an enormous safety net. These types of critical changes tend to be quite stressful for an engineering team, but having a culture in which everybody is willing to help anyway they can, really makes a difference.
- Zero-downtime migrations are not that straight forward as you would expect in the era of cloud services, even when it is inside the same cloud provider.
- Be opportunistic. Do it when you can do it, before you must do it.
- Understand your system and exploit its particularities.
- Dry run, dry run, dry run
I hope this post helps you navigate your own zero-downtime migration. It was a fun problem to solve with a lot of learnings for our engineering team. Do you enjoy these types of problems? Would you have done it differently? We would love to hear from you!