An end-to-end solution for running price experiments.
July 30, 2020
July 30, 2020
Running A/B tests is hard. To get it right, you have to effectively manage user cohorting, data collection, and data analysis. Add in the complexity introduced by subscriptions, and it quickly becomes a problem that most developers won’t tackle — or worse, do incorrectly.
Our mission at RevenueCat is to help developers make more money, and pricing is one of the most effective levers for increasing revenue. While they’re difficult to do well, price experiments are the best way to measure the impact of pricing changes. We love making hard things easy for our users, so we decided to tackle this problem head-on.
We’ve spent the last year developing an LTV model that seamlessly incorporates your app’s data with heaps of aggregate subscription metrics, giving you a powerful data tool not available anywhere else.
If you want to skip the science talk and just learn how to use it, jump to the end.
The Math of A/B Tests
A typical A/B test goes something like this: split users randomly into two cohorts, show each cohort a different version of the app, then measure how many users from each group performed some action. This action might be clicking a button, signing up, or sharing to Twitter.
Most A/B significance calculators you find online assume a situation along these lines. In statistics land, we call this a binomial experiment — a trial is either successful or it’s not. In this case, success is defined as taking the desired action.
The math is a little complex, but after running a binomial experiment, you can then determine whether the measured success rate for one group is higher than the success rate of the other — as well as the likelihood that it’s more than just a statistical fluke. This likelihood is called significance, and it is at the heart of why experiments are difficult to do well.
Measuring Success in Price Experiments
A binomial experiment is a great way to test many things in your app. However, not all measures of success are as simple as a user taking an action.
Price experiments are a great example of this. Measuring how many users purchase something is binomial (they either purchase it or they don’t), but what about when the price changes? In this case, the actions of each cohort are different, and the binomial experiments cannot be directly compared. Add in the complexity of the users’ subscriptions (did they start a trial, did they convert to a paid plan, and how many times did they renew?), and measuring success becomes very complicated very quickly.
The nuances of app subscriptions make price experiments extremely difficult to pull off. We decided to build a model for subscription apps to make price experimentation easy.
Modeling Lifetime Value
The main goal of a price experiment is to determine which cohort of users will generate more revenue over their lifetime. For modeling LTV, we use the following formula:
The LTV of a cohort is equal to the sum for all products of the trial start rate for that product (c), times the trial conversion rate for that product (τ), the average number of renewals for that product (μ), and the price of that product (p).
For example: Let’s say for a cohort of 1000 users, 400 start a trial (c = 400/1000 = .4), half of those trials convert (τ = 200/400 = 0.5), each user renews an average of 3.6 times before canceling (μ = 3.6) and the price of the product (p) is $4.99. Put it all together, and 0.4 * 0.5 * 3.6 * 4.99 gives us an LTV of $3.59.
Why is it useful to simplify LTV like this? Our end goal is to not only estimate LTV, but also to help developers understand the uncertainty of that estimation. By breaking the process down into semi-independent sub-processes, we can begin to model different user behaviors individually.
For instance, determining which product users choose from an offering looks a lot like a multinomial distribution. Whether or not users convert from a free trial to a paid plan looks a lot like a binomial distribution, like the traditional A/B tests we discussed above. This just leaves modeling the average number of renewals, which is a bit more nuanced, but can be modeled using survival analysis models (more on this in a moment).
Using Our Model
For our model to be useful, we have to be able to use it to produce predictions. In our case, the prediction is: given this cohort of users, how many trials they’ve started, how many trials have converted, and how often they are renewing, what do we think their lifetime value will be after 1, 3, or 6 months? We also need to be able to estimate the uncertainty of those predictions in order to tell developers how confident the predictions are.
To do this, we need to “fit” the parameters of the model (c, τ, μ, etc.) to the data we’ve collected. It’s easy for us to pull the data together (it’s all in RevenueCat!), but how do we go about fitting the model? There are a few problems we need to consider:
Some products can take a year to produce renewal information; how do we produce useful predictions sooner?
How do we handle propagating the uncertainty in different parts of the model?
How do we incorporate churns and renewals into the survival model without introducing biases?
Bayes and the Power of Priors
Because RevenueCat has lots of data about subscription performance for many different products and apps, we can make some broad assumptions about things like trial start and conversion rates, renewal rates, etc. These prior assumptions, or priors, allow us to make some basic estimates in the absence of data.
For instance, it is very unlikely that your app will have a 99% trial conversion rate. It’s also unlikely that it will have a 1% trial conversion rate — not impossible, but unlikely. In fact, looking at all the trial conversion rates across our platform, we can make some general estimates of what a trial conversion rate might look like.
This allows us to make performance predictions for things that are difficult to collect data on like annual renewal rates.
The process of combining prior assumptions with new data is called bayesian inference, and it is how we are able to seamlessly blend data about your specific app with our prior estimates.
Error Propagation with Markov Chain Monte Carlo
With the bayesian technique, we can tackle the problem of error propagation. Traditionally, you'd look in a big book to find the equation for the confidence interval of your distribution. This works if your equation is simple (and included in the big statistics book). However, if you have a more complicated model like ours, writing down the confidence intervals in a closed form becomes difficult or impossible.
To get error estimates, we have to use simulation techniques. Instead of writing down an equation, we simulate thousands of scenarios and maximize the likelihood that a parameter produced the data we’ve observed — or in the absence of data, sample parameters from our priors. This technique is called the Markov Chain Monte Carlo (MCMC) method. We use a fantastic library for this called pymc3.
This technique of bayesian analysis via simulation is called “probabilistic programming”. It’s a complex tool, but understanding it can be a real superpower when it comes to solving general inference problems. If you’re interested in learning more about this topic, I recommend the excellent ebook “Probabilistic Programming and Bayesian Methods for Hackers”.
The final hurdle for producing useful predictions with our model is understanding the average number of renewals (μ). We need to be especially careful here to avoid introducing biases.
If we tried to compute the average number of renewals by simply averaging the number of months until someone unsubscribes, we would vastly underestimate the average since we’re only considering users who have unsubscribed! Users who have active subscriptions will contribute positively to the model. We need to incorporate them in our calculations.
Survival analysis is a well-studied area in medicine and engineering that maps very well to our problem. We have a certain number of unsubscribe events we know about (uncensored) and a certain number we do not (censored).
For our purposes, we found the weibull distribution to be a good model for fitting a survival curve. The weibull distribution has some nice characteristics that allow us to account for the possibility of decreasing churn over time. This is helpful because the likelihood that a user will churn in month 1 is usually different than in month 2 and beyond. Which users churn and when is not random — less engaged users churn sooner, and more engaged users are more likely to stick around for the long term.
Using the above bayesian and MCMC techniques, and effectively incorporating censored data and prior estimates from our platform, we are able to produce a more accurate estimate of the average number of renewals.
Putting It All Together: Introducing Experiments by RevenueCat
To help app developers get better A/B test data, we decided to build the tool I’d always wished existed: an end-to-end solution for running price experiments on subscription apps. With our new Experiments tool, you can configure two different Offerings, and our SDK will dynamically change the current offering to display different offerings to each test cohort.
We then collect purchase, renewal, and pricing information and run the data analysis using our model — all within RevenueCat.
Experiments lets you select the earnback period you are most interested in (1 month, 3 months, 1 year, etc.) and gives you a prediction of which Offering has a higher LTV.
Our model also produces confidence intervals for the predicted LTV at different points in a user’s lifetime. This allows you to base your decision-making on the right timescale for your business.
We’re excited to open up the Experiments beta to all our Grow plan users today. We look forward to hearing your feedback and improving Experiments before we release it to a wider audience later this year.
Ready to try it? You can get started with Experiments here.