Tomasz Tunguz (Redpoint VC) and Lloyd Tabb have got it wrong–way wrong. Tunguz recently published an article based on a conversation he’d had with Tabb that suggests early and mid-stage software companies can’t benefit from A/B Testing because they don’t see enough web traffic to make the results statistically significant. They suggest that instead, they should make decisions based on qualitative data:
… interviewing users about the whys underpinning their points of view on price, reviewing the video of people exploring the product, and opinions about design. It’s the qualitative data, the acumen of an brilliant designer, the insight of a skilled product manager, the empathy of a master marketer.
Ouch! Back to anecdotal evidence and marketing decision making by the most important person in the room. Back to the bad old days, in other words. There’s nothing wrong with doing those things they suggest, but before you bet your company on the results, you must A/B test them. These are just inputs to decide what to test, in other words.
Back to anecdotal evidence and the bad old ways of marketing…
Before we throw the AB Testing baby out with the bathwater, let’s take a closer look at what’s possible. The Chief Witness for Tunguz and Tabb is Optimizely’s Sample Size Calculator:
It’s a great tool that I use all the time, BTW. They’ve selected the default view, which suggests that if you have a baseline conversion rate of 3%, and you want to see a minimum 20% detectable effect with 90% confidence, you will need 12,000 visitors to the page.
There are two key questions to explore before we can agree or disagree with the proposition in an informed manner:
- Are these the right inputs for Sample Size Calculator?
- Given the right inputs, is the sample size too large for most startups to attain?
For the first question, I submit that the defaults are actually not very relevant at all. Requiring 90% confidence or be willing to accept anecdotal evidence is pretty silly.
Heck, I run my own bootstrapped startup, it’s entirely my capital that’s at risk (I’ve accepted no outside investment), and I would be thrilled to ring up 70% confidence interval tests all day long.
As it turns out, Optimizely will only let us go to 80% confidence, but Google’s A/B Testing will tell us it’s evaluation of the confidence regardless of level. I will add that the statistical confidence is also not the only factor we should consider. It’s important to make sure you really have a representative sample. For example, test results may vary by day of the week, so I never accept a test that’s run for less than a week, even if the confidence is 90% or more. In fact, I typically prefer 2 weeks as a minimum.
Cutting the Optimizely confidence down to 80% gets us down to a sample size of 11,000. Let’s next consider the baseline conversion rate. 3% is not an especially good benchmark for a product landing page. Groove.com surveyed SaaS companies and came back with a visitor-to-trial conversion rate that averaged 8.4%.
If we plug in an 8% conversion at 80% confidence, the sample size plummets to 3,300 visitors before we can measure a 20% detectable effect. We’ve cut it almost 4x, but we’re not quite done. What about that 20%? Is it not worth conducting A/B tests unless they result in 20% differences?
Here I’ll turn to my own experience AB Testing for my own company, CNCCookbook. In the last 8 months I’ve conducted 55 A/B Tests. The average change between the baseline and the variant I measured was 30%. Are you surprised? I was VERY surprised at how much impact even seemingly little things could have. FWIW, 44% of my tests yielded a positive improvement, 29% showed the idea failed, and 49% of the tests failed to reach statistical significance. I have no idea how that compares to the scores for other marketers, but I am very happy with the results.
If we plug that 30% number in, we get to a sample size of 1,300 visitors. Applying my rule that I usually test for 2 weeks, we need to come up with less than 100 visits a day to the web page we’re testing.
Is that bar too high for startups to clear? It shouldn’t be if the marketers are doing their job right. I’m a one-man bootstrapped company and my CNCCookbook site sees about 15,000 views a day to the site. I get about 250 a day to the home page and about 450 a day to my product home page. As I write this, Google Real Time Analytics cheerfully informs me there are about 50 people running around on my site.
Clearly I can do very statistically significant A/B Testing and it has benefited me quite a lot. I get over 6000 visits in 2 weeks so I can measure as little as a 15% change in that time, and even less if I am willing to let the tests run longer. Incidentally, don’t overlook the value of a test that ISN’T significant. That test is telling you at the very least that even if it is bad, it is no worse than the statistically measurable results. So, if we can test to a 20% detectable effect, adopting the wrong variant will do no more harm than 20%. Sometimes when we need to move ahead boldly, knowing we can do no more harm than that is good enough.
Granted, I’ve had this company for a few years, but if I can get this far by myself, a VC-funded startup should be able to do at least as well and much faster. They have to in order to have much hope for a Unicorn-valuation. Tabb’s company, Looker, the one that presumably prompted the discussion, looks like it should have a little less than half the organic search traffic I get based on SEMRush results. Clearly, Looker should be able to benefit tremendously from A/B Testing if it chooses to.
So, VC Board Members–expect quantifiable results from your portfolio companies and don’t take sample size whining for an answer. Entrepreneurs, saddle up and ride this A/B Testing horse–it’s a powerful tool that can really move the needle.
My best advice for startups right at the beginning, BTW, is start building your audience BEFORE you build your product. I call it achieving Content-Audience fit, I’ve been writing about it for years, and it is absolutely the very first thing a founding team should do when they get together. Achieving it provides a number of powerful validations for your team, but more importantly, it validates there is a reachable audience, and in reaching it, you gain a powerful tool for shaping your journey to Product-Market Fit. Not to mention, you set yourself up to achieve enough traffic to do meaningful A/B Testing just that much sooner.
Stealth Mode is harmful in this respect–it delays your access to Content-Audience fit for no meaningful benefit. So what if the world knows what broad market you’re working in or even what broad problems you write about? You don’t have to tell them anything about your product or how it helps solve those problems.
No more excuses–get on with your marketing people, and do some rigorous AB Testing of it!