Postmortem for Aurora Postgres Migration November 23 2022

tl;dr

On November 23rd, we migrated from our existing AWS RDS cluster running on Aurora Postgres 10.x to a new cluster running Aurora Postgres 14.x. Once we switched over to the new cluster, performance degradation caused most of our backend systems to fail. This was mainly caused by the new cluster not being analyzed, causing the query planner to become incredibly inefficient. It took us about three hours to find the root cause, and while we were able to recover most of the service before reaching the solution, it took longer than we’d have liked and during that time, affected user experience.

For users of applications using the SDK with existing purchases, everything functioned as normal. For new purchases, while succeeding on the stores, the unlocking of entitlements was not working or spotty during a 2h period. For affected users, reopening the app after the service disruption should lead to reconciliation, without the need for additional user actions.

Background

On February, 2022 we received a notice that AWS is planning on ending 10.x support for Aurora PostgresSQL on January 31st, 2023. This means that, before that date, we needed to upgrade our main database cluster.

In the past, we’ve successfully completed two of these major upgrades and a handful of minor upgrades. For this upgrade, we decided on a different approach, based on DB replication. We planned to restore a snapshot on a new cluster, upgrade it to Aurora Postgres 14.x, and then replicate changes from the production cluster. Once replication was caught up, we would briefly switch the old database to be read-only and then failover to the new cluster after the last changes were replicated (<10s).

We chose this approach to ensure data consistency, as replication at the DB level guarantees higher levels of data correctness than replicating writes at the application level. App-level replication makes db state diverge and makes it harder to guarantee and verify data integrity. The size of – and traffic on – our DBs have also increased massively since the previous upgrades, making replication at the app-level more challenging and risky than before.

Since this is a complex (44 steps) and dangerous operation, over two months were spent by the team preparing for the migration and performing tests and many dry-runs. After testing, we concluded that we would end up with less than 10s of full and 1-2 minutes of partial downtime. During that short window of time, requests that write to the database would fail (reads unaffected). With SDK retries this should be almost unnoticeable.

During the 1-2 minute transition of traffic from old to new clusters, users might read data from the old cluster while some write is performed on the new, leading to small inconsistencies. We tuned our cache to use small TTLs so no inconsistencies were persisted on cache after the transition.

Impact

Dashboard

The RevenueCat dashboard was inaccessible from 9:39 UTC till about 13:07 UTC. Around ~11:30 UTC about half of the requests were going through, ramping up to 100% at 13:07 UTC.

Apps Using the RevenueCat SDK

The endpoints used by the RevenueCat SDK were experiencing issues from 9:39 UTC till about 13:07 UTC. Around ~11:30 UTC about half of the requests were going through, ramping up to 100% at 13:07 UTC.

Existing Customers

For the vast majority of existing customers in apps using RevenueCat, entitlements would still remain unlocked as this data is cached in the app. Even when a request fails, the cache is used (to account for things like outages and airplane mode).

For customers that had an expired entitlement (for example, your billing cycle might end a week ago and you haven’t opened the app since then) they would have lost access during the outage period.

New Customers

For new customers attempting to purchase subscriptions, the purchase from Apple/Google/Amazon/Stripe would succeed, but their entitlements would not unlock properly or required several attemps. The SDK would persist the receipt data and retry to send it each time the app is foregrounded. If the app was closed the next time it is loaded it resends the receipt and unlock the entitlements. No specific user actions are required to reconcile purchases done during the time.

Integrations

Integration event dispatching was not stopped in the original switch over to PG14. This meant for a brief period of a few minutes while PG 10 was in read only, the dispatcher job would dispatch integration events, but would fail to update the events table. This was caught a few minutes into the switchover and the dispatcher was manually stopped.

This resulted in events still being dispatched, but failing to update their status in the database.

For the webhooks that were dispatched but not recorded in our database we replayed out of abundance of caution, since some failures to deliver might not have been written to our DBs for redelivery.

Other integrations have not been replayed. These integration events (including webhooks) were likely delivered, but weren’t recorded on our end. For a lot of these integrations, there’s a risk that replaying them could lead to damage or inconvenience (for example, sending Slack messages for old purchases that already have slack messages, etc).

If there are specific cases where you would like to have integration events replayed that haven’t been replayed yet, we can do this for you on a case by case basis.

Root Cause

Once we switched over to the new cluster we immediately suffered very serious performance degradation. The root cause of the performance degradation was the new database cluster. High throughput queries become incredibly inefficient because ANALYZE was not run on our largest tables. When preparing for this migration, we did compare query performance and query execution plans between our 10.x cluster and the test 14.x cluster. However, we didn’t notice this problem since during our performance analysis the 14.x cluster had been up for enough time to finish vacuuming tables and thus gathering the required stats for the query planner. During the migration day the vacuum for one of the largest tables didn’t finish before the cutover, and this was the cause of the performance degradation and associated cascading failures.

Cascading Failures / Problems

While the reasons above are the most impactful causes, since RevenueCat is a fairly complex system, these issues resulted in cascading failures and problems which lead to a delicate recovery process.

Unclear source of the problem

The biggest challenge during the event was how hard it was to identify the problem. The DB performance issues caused cascading failures in all the systems and it took us about 1 hour to discern between cause and effect.

During the traffic shift to the new web tier we started seeing DB performance issues. We did our best to warm up the new database cache by reading all records added in the last 24h in the critical tables, but we still initially attributed the issue to the cluster being cold.

DB was able to warm up and spend less time on IO, and then our focus shifted to assess the health of our pg pool cluster that looked totally saturated. The slow queries were causing a lot of contention on pg bouncers waiting for a free connection, but it wasn’t clear if it was pg bouncer or DB slowness.

We decided to shed load arriving at the cluster and open traffic slowly. This helped a lot allowing us to better see the source of the problem as our services were able to recover.

Eventually we found one of the queries that was extremely bad performing. The query was using the wrong index, but it wasn’t clear this was due to lack of analyze, it could have been a different choice of index by postgres 14 query planner. We proceed to drop the index that was being wrongly used since it was not needed for other use cases. This fixed that particular query and at this point we were able to sustain the load with higher latencies than usual.

Another query showed that postgres wasn’t able to choose the obvious index even in very simple cases, and pointed to the lack of stats on the table as the source of issues. The analyze took quite long to finish while we were battling the system to keep it working under the load.

Rollback

Rollback was considered, but the rollback procedure for this operation after the critical transition was pretty complex, since it involved reversing the replication to sync writes from the new cluster into the old one, and shifting traffic again to the old web tier. During that time we would not be accepting writes as well. During the incident it wasn’t initially clear whether the performance issues were due to shifting the load quickly and whether a rollback would have caused the same load issues on the old db cluster if we decided to rollback. By the time we understood the issue, the service was operating reasonably well with all the mitigations deployed, and we decided to move forward with the fix. In retrospective, rollback at the beginning would have been a better choice.

Incoming Apple Webhook Requests

Once we started accepting traffic again, we started receiving incoming Apple webhook requests. These requests had a particularly slow query due to the query planning issues described above, and caused overall recovery to suffer again. Once we temporarily stopped accepting Apple webhooks, we were able to recover much more quickly.

Additional Query Load Due to Third Party Application Performance Monitoring Agent EXPLAIN Queries

Our new cluster ended up with long-running EXPLAIN queries for nearly every query. This was because a third-party application performance monitoring (APM) tool we use will run EXPLAIN on slow queries. Since many of our queries became slow, this resulted in a lot of EXPLAIN queries running that also ate up database resources. To work around this, we rolled out a new release without New Relic enabled as we have many other dashboards to provide visibility (primarily AWS CloudWatch and our internal Prometheus/Grafana setup).

Duplicate Primary Keys

For a handful of large tables, we ended up with primary key sequences running into their values violating unique constraints for rows already in the tables. We did not realize that the state of sequences would not be copied to the databases. Official documentation describes the limitations of sequence replication. Also, this blog post describes it even more clearly:

“The values of the table will be copied to the subscriber of course, but the state of sequence will not.”

So, the main two errors were:

Missing sequence replication limitations: Assume that the state was replicated as well.
Lack of testing writes: We should have replayed writes during our dry-run tests: stop replication and do normal writes to mimic the cutover more closely. We assumed that since writes are simpler vs reads on dbs, writes did not require the same amount of testing and we failed to better assess the risks.

We were able to mitigate this issue quickly by automatically increasing the value for the impacted sequences, thanks to a previously prepared script.

Detailed Timeline

This timeline covers events from when the actual migration started, excluding prep work. All times are in UTC on November 23rd, 2022. During the incident, we kept a running text log of when things took place. Some items have estimated times (marked with ~).

Time (UTC)	Description
2:51 UTC	New cluster is created via snapshot, replication of new changes from old cluster to new cluster begins.
8:51 UTC	Replication has caught up and all prep work is done.
9:34 UTC	Old cluster is switched to read only. Read-only requests still work, requests with writes are expected to fail for less than 2 minutes.
9:34 UTC	Replication is fully caught up changes since switching to read-only within 10s. We are now ready to switch traffic to use the new cluster. Additionally, integration event dispatching, ETLs, and other non-web server jobs will be stopped until we switch over.
9:34 UTC	Start of outage. Traffic begins to switch to the new cluster, writes begin to flow in the new cluster.
9:39 UTC	100% of traffic is going to the new cluster. Latency is not looking good. Infrastructure team begins to investigate.
9:54 UTC	We are seeing statements take much longer than normal and using all db connections. We add statement timeout to the db user to ensure queries do not use a db connection for too long.
9:58 UTC	Things are still not looking good. We begin to shed the majority of traffic via sending 503s from the load balancer to 90% of traffic, to leave the system enough headroom to recover. This also helps all systems recover and we can see db slowness as the source of issues.
~10:00 to 10:30 UTC	After running EXPLAIN on some of the long-running queries, we realize the query plans are incredibly inefficient. Some indexes being used are doing more harm than good and aren’t needed. We begin to drop these indexes to hopefully alleviate issues.
~10:15 to 10:30 UTC	Index tweaks are showing promise. We’re able to accept 50% of traffic with elevated response times.
~10:30 to 13:07 UTC	End of outage (except for integration events). We’re able to gradually get back to 100% of traffic (with elevated latency). To do this, we have to temporarily stop accepting webhooks from Apple (our offline receipt refreshers will eventually get to these receipts).
~12:00 to 13:07 UTC	After running through many query plans, it finally becomes apparent that ANALYZE did not run on the new cluster. ANALYZE is run on our biggest tables manually and database performance drastically improves. We’re now confident that we can bring everything back online.
12:11 UTC	Separate Slack channel to field out support + social comms created.
13:33 UTC	Everything else is restored (integration event dispatching, ETLs, and other non-web server jobs).
14:51 UTC	Email sent out to all customers notifying them of the outage.

Action Items and Learnings

Among the many internal action items, we want to call out the following:

Timing and communication:

Timing-wise, we chose to complete this before Thanksgiving and Black Friday which we knew are important for business. In retrospective we should have delayed until the week after, as we’d entered a sensitive period in the US and part of our US support team was already out on holidays. Going forward, we want to consider implementing freezes during critical times.
We need to better communicate the reliability of SDK handling during backend outages. There was a lot of confusion and anxiety over the impact of the outage on applications.
Improve our communication strategy to better inform customers of maintenance events and outages (for example in the developers dashboard and via direct email to technical stakeholders). We’d kept the status page up to date with both the maintenance and following issue, and even though this page offers subscriptions, these are not easily discoverable.

Prevention:

The dry-runs did not replay write traffic on the restored and updated cluster, they just focused on data consistency and query performance. This wasn’t done as we thought that writes are simpler than reads (no indexes involved), obviously a bad decision in retrospective: We could have found the issues replaying a subset of traffic to more closely mimic the 0-day transition.
Invest (even) more into moving costly processes to be as asynchronous as possible to keep api fast and stable.
Re-evaluate table indexes, we have indexes that are not used and can cause degradation if the database chooses to use it for running queries.

Handling and escalation:

Improve how we surface the load of different endpoints and systems in our DB dashboards, so we can quickly identify where the load is coming from. It took us too long to identify that the apple webhooks load was impacting our recovery efforts.
Involve support and customer communications during outages to handle communications.

All in all, we’re deeply sorry for the inconvenience and anxiety this caused RevenueCat users. This was a learning experience, and we promise to do better in future.

Rest to express my thanks to the team that worked tirelessly before, during, and after the incident. Without their contributions, this post mortem and its follow ups wouldn’t have been possible.