Monday, October 29, 2012

A/B testing scale cheat sheet

This is not a guide to how to do A/B testing.  If you want that, see Effective A/B Testing, or any number of companies that will help you with A/B testing.  Instead this is a cheat sheet of basic facts on A/B testing (mostly on the scale involved) to help people who are beginning figure out what is feasible.
  • If you've never tested, expect to find a number of 5-20% wins.

    In my experience, most companies find several changes that each add in the neighborhood of 5-20% to the bottom line in their first year of testing.  If you have enough volume to reliably detect wins of this size in a reasonable time frame, you should start now.  If not, then you're not a great candidate for it..yet.
  • Experienced testers find smaller wins.

    When you first start testing, you pick up the low-lying fruit.  The rapid increase in profits can spoil you.  Over time the wins get smaller.  And more specific to your business.  Which means they will take longer to detect.  Expect this.  But if you're still finding an average of one 2% win every month, that is around a 25% improvement in conversion rates per year.
  • What works for others likely works for you.  But not always.

    Companies that test a lot which have settled on simple value propositions, streamline their signup process, put down big calls to action, places the same call to action in multiple places, and do email marketing.  Those are probably going to be good things for you to do as well.  However Amazon famously relies on customer ratings and reviews.  If you do not have their scale or product mix, you'll likely get very few reviews per product, and may get overwhelmingly negative reviews.  So borrow, but test.
  • Your testing methodology needs to be appropriate to your scale and experience.

    A company with 5k prospects per day might like to run a complex multivariate test all of the way to signed up prospect, and be able find subtle 1% conversion wins.  But they don't generate nearly enough data to do this.  They may need to be satisfied with simple tests on the top step of the conversion funnel.  But it would be a serious mistake for a company like Amazon to settle for such a poor testing methodology.  In general you should use the most sophisticated testing methodology that you generate enough data to carry off.
  • Back of the envelope: A 10% win typically needs around 800-1500 successes per version to be seen.

    One of the top questions people have is how long a test takes to run.  Unfortunately it depends on all sorts of things, including your traffic volume, conversion rates, the size of the win to be found, luck, and what confidence you cut the test off at.  But if one version gets to 800 successes, when the new one is at 880, you can convert at a 95% confidence level.  If you wait until you have 1500 versus 1650, you can convert at a 99% confidence level.  This data point, combined with your knowledge of your business, gives you a starting point for planning.
  • Back of the envelope: Sensitivity scales as the square root of the number of observations.

    For example a 5% win takes about 2x as much sensitivity as a 10% win, which means 4x as much data.  So you need 3200-6000 successes per version to see it.
  • Data required is roughly linear with number of versions.

    Running more versions requires a bit more data per version to reach confidence.  But not a lot.  Thus the amount of data you need is roughly proportional to the number of versions.  (But if some versions are real dogs, it is OK to randomly move people from those versions to other versions, which speeds up tests with a lot of versions.)  Before considering a complicated multivariate test, you should do a back of the envelope to see if it is feasible for your business.
  • Even if you knew the theoretical win, you can't predict how long it will actually take to within a factor of 3.

    An A/B test reaches confidence when the observed difference is bigger than chance alone can plausibly explain.  However your observed difference is the underlying signal plus a chance component.  If the chance component is in the same direction as the underlying signal, the test finishes very fast.  If the chance component is the opposite direction, then you need enough data that the underlying signal overrides the chance signal, and goes on to still be larger than chance could explain.  The difference in time is usually within a factor of 3 either way, but it is entirely luck which direction you get.  (The rough estimates above are not too far from where you've got a 50% chance of having an answer.)
  • The lead when you hit confidence is not your real win.

    This is the flip side of the above point.  It matters because someone usually has the thankless task of forecasting growth.  If you saw what looked like an 8% win, the real win could easily be 4%.  Or 12%.  Nailing that number down with any semblance of precision will take a lot more data, which means time and lost sales.  There generally isn't much business value in knowing how much better your better version is, but whoever draws up forecasts will find the lack of a precise answer inconvenient.
  • Test early, test often.

    Suppose that you have 3 changes to test.  If you run 3 tests, you can learn 3 facts.  If you run one test with all three changes, you don't know which change actually made a difference.  Small and granular tests therefore do more to sharpen your intuition about what works.
  • Testing one step in the conversion funnel is OK only if you're small and just beginning testing.

    Every business has some sort of conversion funnel which potential customers go through.  They see your ad, click on it, click on a registration link, actually sign up, etc.  As a business, you care about people who actually made you money.  Each step loses people.  Generally, whatever version pushes more people through the top step gets more business in the end.  Particularly if it is a big win.  But not always!  Therefore if testing eventual conversions takes you too long, and you're still finding 10%+ wins at the top step in your funnel, it makes business sense to test and run with those preliminary answers.  You'll make some mistakes, but you'll get more right than wrong.  Testing poorly is better than not testing at all.
  • People respond to change.

    If you change your email subject lines, people may be 2-5% more likely to click just because it is different, whether or not it is better.  Conversely moving a button on the website may decrease clicks because people don't see the button where they expect it.  If you've progressed to looking for small wins, then you should keep an eye out for tests where this is likely to matter, and try to dig a little deeper on this.
  • A/B testing revenue takes more data.  A lot more.

    How much more depends on your business.  But be ready to see data requirements rise by a factor of 10 or more.  Why?  In the majority of companies, a fairly small fraction of customers spend a lot more than average.  The detailed behavior of this subgroup matters a lot to revenue, so you need enough data to average out random fluctuations in this slice of the data.
  • Interaction effects are likely ignorable.

    Often people have several things that they would like to test at the same time.  If you have sufficient data, of course, you would like to look at each slice caused by a combination of possible versions separately, and look for interaction effects that might arise with specific combinations.  In my experience, most companies don't have enough volume to do that.  However if you assign people to test versions randomly, and apply common sense to avoid obvious interaction effects (eg red text on red background would cause an interaction effect), then you're probably OK.  Imperfect testing is better than not testing, and the imperfection of proceeding is generally pretty small.
As always, feedback is welcome.  I have helped a number of companies in the Los Angeles area on A/B testing, and this tries to  the most common questions that I've encountered about how much work it is, and what returns they can hope for.

Wednesday, October 17, 2012

My son's flashcard routine

My 7 year old son is in grade 2. In the previous grade, despite his intelligence, he was significantly behind his class in handwriting, letter reversals, and spelling. He was getting extra help from his teacher, but he still had an uphill battle. So I decided to start a flashcard routine to assist. This solved the original problem.  Here is a description of the current routine, and how it has evolved to this point.

It will surprise nobody who has read Teaching Linear Algebra that I started with the thought of some sort of spaced repetition system to maximize his long-term retention with a minimum of effort.  I needed to help him with around handwriting, so I wanted to be personally evaluating how he was doing.  This seemed simplest with a manual system.  I therefore settled on a variation of the Leitner system because that is easy to keep track of by hand.

To make things simple for me to track, I am doing things by powers of 2.  Every day we do the whole first pile.  Half of the second.  A quarter of the third.  And so on.  (Currently we top out at a 1/256th pile, but are not yet doing any cards from it.)  Cards that are done correctly move into the next pile. Those that he get wrong fall into the bad pile, which is the next day's every day pile.

So far, so good.  I tried this.  Then quickly found that I did an excellent job of sifting through all of the words he knew and getting the ones he didn't know into the bottom pile.  But he wasn't learning those.  This lead to frustration.  Not good.

I then added an extra drill on the pile that he got wrong.  At the end of the session, we do a quick drill with just the problem cards.  Here is the drill until we get to 3 cards.  If he gets the card first try, or gets a card that came all of the way from the bottom since he last got it wrong, it is removed from the drill.  If he gets it wrong, I tell him how to do it, and put it back in the pile near the top so he sees it again soon.  If he gets it right after a recent reminder, it goes to the bottom to get a chance to come out of the drill.

After we get down to 3 cards, I switch the drill up.  If he gets a card wrong I correct him and put it in slot 2.  If he gets it right I put it on the bottom.  Once he gets all three right, I end the drill for that day.

After I added this final drill on the problem cards, the "not learning" problem disappeared.  He began learning, and saw his school performance improve.  His spelling tests went from under half the words correct to the 80-100% range.  Everyone was happy.

It is worth noting that at the end of grade 1 he took several tests, and we found that he was spelling at a grade 3 level.  We have no direct measurement proving it, but I guarantee that he spells even better now.

This happiness lasted until he got used to doing well.  Over time we had more piles.  In school he was being given more words.  I began adding simple arithmetic facts.  This meant more and more work.  Not fun work.  Sometimes he would make a mistake on a card that he had known for a long time.  Then he'd get upset.  Once he got upset he'd get lots of others wrong.  Over the next few days we'd get the cards moving back up the piles, then it would happen again.  The flashcard routine became a point of conflict.

Then I had a great idea (which I borrowed from a speech therapist).  The idea is that I'd mix a reward activity and flashcards.  We'd start on the reward, then do a pile, go back to the reward, then do another pile, go back to the reward, and so on.  The specific reward activity that we're using is that I'm reading books to him that are beyond his current reading level (currently The Black Cauldron), but in principle it could be anything.  With this shift, the motivation problem completely disappeared.  He enjoys the reward.  The flashcards are a minor annoyance that gets him the reward.  If he goes off track, the reward restores his equilibrium.  Intellectually he's happy that he's mastering the material.  But the reward is motivation.

With this fix in place, we lasted several months.  Then we developed an issue.  A couple of words were sufficiently hard that they just stayed in the bad pile every day.  So I made a minor tweak.  I had been doing his top pile, then his next, then his next, on down.  But instead I do his every day pile.  Then go into the top pile, next, next, etc.  But after each of those groups I try him again on the every day words that he hasn't gotten right yet.  Thus he is forced to get his trouble words right 2x per day.  This helped him master them and got them moving back up.

With that fix, we lasted until this week.  This week we had a problem.  His spelling test for this week includes the word embarrassing.  (And he can get a bonus for knowing peculiar.)  The problem is that this word has enough spelling tricks to get it right that he simply cannot get it in one pass.  We tried several times, without success.  I therefore have added flashcards like em(barr)assing for which he gets told, "The word 'embarrassing' starts 'em'.  Write the 'barr' bit."  With these intermediate flashcards he seems to be breaking up learning the whole word into manageable tasks, from which he can learn the word itself.  But I've also generated a ton of temporary flashcards, which may become an issue.  (I plan on removing those piecemeal ones after he successfully gets them in the every 8 day pile.  In a few weeks I'll know how well this is working)

That brings us to the current state of his flashcard routine.  He currently has hundreds of spelling words and basic arithmetic facts learned.  373 of them learned sufficiently well that he reviews them less than once per month.  But I am sure that I'm not done tweaking.  Here are current issues:
  1. One week is not enough.  Every week he is given a new set of words to master.  But as anyone who has done spaced repetition knows, a week is not very long to master material.  Spaced repetition excels for memorizing a body of data over years, not one week.  On most weeks he is given a set of standard words to learn, and a set of words for bonus points.  With the bonus words he usually gets over 100% on his tests.  But we don't stop, so now he'd do substantially better on last week's test than he actually did last week.
  2. He's only learning what I know that he needs to.  This week I reached out to his teacher and said that I am doing flashcards with him, and looked for feedback on more ways to use them for his benefit.  She pointed out a number of things he can improve on, including common words that he has wrong, grammar, poems he is supposed to memorize, and geography that he is supposed to learn.  The flashcard routine can help with these issues in time, but I had not been aware that he needed it.  Better late than never...
  3. Work is climbing again.  Currently every day I add 2 cards.  Plus every week I add a spelling test of unpredictable size (this week 27, of which he already knew one).  This is increasing the size of the bottom piles, and the work has been increasing.  It is manageable, but I'm keeping my eye on it.
  4. This takes my time.  At the moment that's unavoidable.  One of the issues that we're still working on is handwriting, so there needs to be a human evaluation of what he's doing.  But still I'm taking an hour per day with this.  I think it is an hour well-spent that we both value.  However in a couple of years if his sister needs similar help, what then?  In the long run I'd love to offload the flashcards to a computer program, but the idea of a reward activity has to be in there.  All of the flashcard apps that I've seen assume that doing flashcards is itself a fun activity.  That will not work for my son.  Maybe I'm being too picky.  But I've developed opinions about what works while fine-tuning my son's system.  If there is something that fits that, I'd love to find it.
If you've wound up building a similar or different system to help your children learn, please tell me in the comments.  I've borrowed ideas from all over, and would be happy to try anything reasonable that gets suggested.

Monday, October 8, 2012

How reliable will the Falcon 9 be?

Let's apply statistics to see, based on current launch data, how reliable we predict that the Falcon 9 will be.

Falcon 9 just had a launch that succeeded despite an engine failure.  According to design parameters, it should be able to survive the failure of any two engines.  But the flight can be lost if we lose 3+ engines.  Exactly how reliable is the Falcon 9 design?

Let me first take a naive approach.  To date we've had 4 launches of the Falcon 9, each with 9 engines (that's the 9 in Falcon 9), and have seen one in flight failure.  The measured success rate of an engine is therefore 35/36.  With that in mind, we can produce the following figures.
  • Probability of no engine failures: (35/36)**9 * (1 - 35/36)**0 * (9 choose 0) = (35/36)**9 = 77.6%
  • Probability of 1 engine failure: (35/36)**8 * (1 - 35/36)**1 * (9 choose 1) = (35/36)**8 * (1/36) * 9 = 20.0%
  • Probability of 2 engine failures: (35/36)**7 * (1 - 35/36)**2 * (9 choose 2) = (35/36)**7 * (1/36)**2 * 36 = 1.8%
  • Probability of 3+ engine failures: 1 - above probabilities = 0.2% (actually 0.16%)

For comparison the US Space Shuttle had a failure rate of 2/135 which is about 1.5%.

So SpaceX flights are dangerous compared to most things that we do, but so far seem much better than any previous mode of transport, including the US Space Shuttle.  Which was previously the most reliable form of transport into space.  (Not the safest though!  Soyuz has that record because, unlike the Space Shuttle, they've demonstrated the ability to have passengers survive a catastrophic failure that aborted the mission.)

But is that the end of the story?  No!

Suppose that the true failure rate of each individual engine is actually 10%.  Then an exactly parallel calculation to the above will find that the failure rate of a rocket launch is 5.3%.  That doesn't sound very reliable!

However is it reasonable to think that 10% is a likely failure rate for the rocket?  Well suppose that before we had seen any launches that we thought that a 10% failure rate was equally likely as a failure rate of 1/36.  Our observation is 1 engine failure out of 36.  The odds of that exact observation with a 10% failure rate are 9.0%.  The odds of that observation with a failure rate of 1/37 are 37.3%. According to Bayes' theorem, the probabilities that we give to theories after making an observation should be proportional to our initial belief of the probability of that theory times the probability of the given observation under that theory.

That is a mouthful.  Let's look at numbers.  In this hypothetical scenario our initial belief was a 50% chance of a 10% failure rate, and a 50% chance of a failure rate of 1/36.  After observing 36 instances of engines lifting off with 1 failure, the 10% theory has probability proportional to 4.5%, while the 1/36 theory has probability proportional to 18.35%.  Thus our updated belief is that the 10% theory has likelihood 4.5/(4.5 + 18.35) = 0.199 = 20%.  (Without the intermediate rounding we'd actually be at 0.195.)  And the 1/36 theory has likelihood around 80%.  Then combining the predictions of the theories with the likelihood assigned to each theory we get an estimated failure rate of 0.053 * 0.195 + 0.0016 * 0.805 = 0.023= 1.16%.  Our confidence in the record put up by the Falcon 9 is not as good now!

Please note the following characteristics of this analysis:
  1. Observations do not tell us what reality is, they update our models of reality.
  2. A wide range of failure probabilities fit the limited observations that we have so far on the Falcon 9.
  3. With enough data, theories that are far away from the observed average become very unlikely.
Now a curious person might want to know what the odds of failure would be if we included more possible prior theories.  I whipped up a quick Perl script to do the calculation for an initial expectation that 0.00%, 0.01%, 0.02%, ..., 99.99%, 100% were all equally likely failure rates a priori.  When I run that script I get a probability of 0.0198180199757443, which is an estimated failure rate of about 2%.  If you start with different beliefs, you can generate very different specific numbers.  For an extreme instance if you believe that SpaceX is constantly improving, so their future engines are likely to be more reliable than their past ones, then ridiculously good numbers become very plausible.

However the bottom line is that we cannot yet, based on the data that we have so far, conclude that we have good evidence that the Falcon 9 actually will put up a better reliability record over its lifetime than previous space vehicles.