Thursday, March 23, 2023

Temporary note

I am posting this to demonstrate that this account is associated with btilly@gmail.com. I'll delete it soon, so don't get attached to it.

Thursday, January 7, 2016

Why NSA surveillance scares me

Nothing To Hide explains why you should care about surveillance even though you have nothing to hide. But it missed the point that most concerns me.

Surveillance states target people who are disliked by the established political order. Starting with politicians who might inconvenience those in charge of surveillance. This distorts the political system in a way that should scare everyone.

Here are two examples from fairly recent US history:
  1. J. Edgar Hoover maintained control of the FBI and its predecessor from 1924 to 1974 because politicians knew he had the goods on them. So nobody wanted to challenge his authority, no matter what they thought of him.

    For example, he ignored the Mafia until The Apalachin Meeting was discovered by local police and reported nationally by newspapers in 1957. Even then, organized crime was prioritized below his private anti-Civil Rights Movement vendetta.  Even though this put his policies at odds with several Presidents.

    What new Hoover will emerge in the future, and what political agenda will he impose on all of us?
     
  2. Richard Nixon is one of the most influential Presidents in US history. His accomplishments range from ending the Vietnam war to opening relations with China to creating the EPA. But he is remembered for Watergate. Why?

    The Watergate scandal was about Nixon conducting surveillance against political opponents, then covering up evidence after it was discovered. The Washington Post traced a minor burglary back to Nixon, he was indicted by Congress, and then resigned on August 9 1974 once was clear that The Senate would uphold the indictment. He was re-elected in one of the largest landslides in US history not long before. He was widely hated for a generation after.

    Public opinion and our politicians responded to his case with fear that his actions put us on the path to dictatorship. Memories of Germany and WW II were still fresh, and nobody wanted that to happen here.

    Nixon wouldn't have been caught today. Current NSA infrastructure gives many thousands of analysts access everything that Nixon wanted, with no clues from which they can be detected outside the system. And there is no hope that they would be caught from inside the system, either. Nixon got the active cooperation of the FBI. But let's assume a rogue President didn't manage that, he only subverted one rogue analyst. According to NSA documents, there is a "we trust you'll stop" system to keep analysts from spying in the US. Analysts don't even file a report if they have done it. According to other revelations, pre-Snowden the NSA had no other effective oversight mechanism. The most likely result of an effective oversight mechanism is the possibility of embarrassment for the NSA, so they probably haven't created one since.

    Will our political system get subverted? Has it been already? How would we know?
     
How do I rank this threat versus the terrorist threats that the system is supposedly protecting me from?

In my lifetime, the total number of Americans killed or injured on US soil by terrorism and related causes is under 5000. (Here is a list of incidents.) There are over 300 million Americans alive and my life is more than half-over, so my future risk of injury or death from a terrorist attack should be under 1/60,000.

Let's analyze the other side conservatively. No government in history has ever survived 3000 years. I doubt that the US government will be an exception. Surveillance sliding into dictatorship is one of the most common ways that democracies fall. So it is reasonable to give us greater than a 1% chance of falling into dictatorship in any random period of 30 years. Such as the rest of my life. I have a habit of speaking my mind in public, just as I'm doing here. Suppose this gives me at least a 10% chance of showing up on some surveillance list. Let's further suppose that being on that list gives me at least a 10% chance of having bad things happen to me once the government falls. I now have greater than a 1/10,000 chance of death or injury in my lifetime from our surveillance state.

Therefore our surveillance state is many times more likely to result in my death or injury than the terrorists which it supposedly protects me from. Both possibilities are unlikely, but surveillance is the much bigger risk.

Stop and think about that. I don't do much wrong. I'm not out to overthrow or hurt anyone. I may not like our politicians, but I bear them little ill will. I believe that the vast majority of people who are part of our security structure have the best of intentions. I believe that most of our NSA and military are truly devoted to protecting people like me. But I still conclude that I am many times more likely to suffer harm from NSA surveillance than I am from Muslim terrorists.

This fact scares me. Maybe you should be scared too.

Tuesday, August 13, 2013

Becoming a web designer

I recently was having a conversation with an art student who wanted a better future than her job as a barista was likely to provide.  I suggested a job as a web designer.  She had no idea what would be involved, so I promised to talk to a web designer that I knew then get back to her with advice about what it would take.

He laid out a very concrete guide to follow for someone starting with an art background, talent for composition, proficiency with photoshop, and a willingness to learn.  I'm putting it up as a blog in case other people have suggestions, corrections, or can benefit from the advice.  Here is the guide.
  1. Design your own website in Photoshop.  What the website is does not matter, but making it look good does.  Think up something obvious (for instance a gallery for your artwork), figure out what pages it will have, what they should look like, then make prototypes in Photoshop.  If you've got artistic skills and know Photoshop well, this hopefully is fairly easy.
  2. Build that website in WordPress or another CMS (Content Management System).  Now you know what you want the web page to look like, now try to make a web page that looks at it.  You'll need to buy hosting, buy a domain, then install your CMS.  Now try to make WordPress serve up pages that look like the ones that you have designed.  You will encounter a lot of problems.  Google searches will often help you find solutions.  Browse http://wp.smashingmagazine.com/ for more ideas.  And if you need to compromise on your initial design because you can't figure out how to make it work, do so.  The goal here is to get something done that looks nice, and not necessarily to perfectly realize your first vision.
  3. Start freelancing.  Once you have demonstrated that you can build an attractive website, you have a skill that people are willing to pay for.  Go to https://www.elance.com/ and start offering your skills as a web designer.  You should start at the low end of the free-lance market, say around $50/hour.  You won't get steady work, but it is a nice side income.  And, more importantly, you're building your experience and a portfolio.
  4. Look for full time work as a web designer.  In a few months you should have 6-8 actual websites under your belt, and will have learned a lot more about working with customers.  This is a real portfolio to point to during a search for full-time work as a web designer.  Depending on luck, opportunities, location, etc, you might land the job you want, or may need to take an intern to full-time route.  A reasonable salary in the Los Angeles area to aim for with the full time job is probably around $70K to do design and HTML.  That varies widely over time, by geographical market, by company, etc.
  5. Upgrade your skills.  Once you have a job, you've turned skill at composition, knowledge of photoshop, and a willingness to learn into a real career.  But you have just begun.  There is a lot more to learn.  In particular over the next few years you need to improve on the underlying technologies of HTML, CSS, and JS.  You should be trying to educate yourself about UX.  You need to build a professional network.  There is a long ladder to climb, but at least you've got your foot on the bottom rung.
This outline was recommended by one person that I know who became a web designer.  It sounded reasonable to me, but I am not a web designer.  Starving art students may be surprised at how quickly you go from designing something for yourself to getting paid significant money for your time.  That happens because you're solving real problems.  Your combination of artistic talents and willingness to tackle new things is worth far more to employers than either is alone.

Please share if you find this helpful, have alternate suggestions, experience, etc.

Wednesday, November 21, 2012

Speculating about the Hyperloop

Elon Musk has been dropping hints about his Hyperloop idea.  (We cannot call it a proposal because he has not actually proposed it yet.)  There is a lot of curiosity about what it might be.  Given Elon's history, the idea will sound audacious and yet will actually be workable.

Jacques Mattheij recently speculated on the topic.  His proposal has the serious problem in that the friction of the air on the sides of the tunnel would lose way too much energy.  But it got me thinking, and I have what may be a more realistic proposal.

First, imagine a tube that goes in a loop from Los Angeles to San Francisco.  Let's put flaps on the walls.  When they are open, air pressure can equalize.  When they are closed, they don't leak much.  Now let's put large, heavy objects going around and around the loop.  For lack of a better name, let's call them plungers.  The plungers can be floated and moved very efficiently with maglev technology.  As each plunger approaches, the flaps open so that air can get pushed out, then closes so that it doesn't come back in.  This is not an evacuated tube (Elon explicitly says that his technology isn't an evacuated tube), but results in a decent vacuum away from the shockwaves in front of each of plunger.  That eliminates most of your friction losses.  I don't know how low, but Elon claims low enough that solar panels on top of the device provide more than enough energy to keep it permanently going.  I see no reason to disbelieve that solar panels could do that.

Now where do the people fit in?  People go into vehicles that I'll call cars, even though they aren't really cars.  These cars can be fired by a railgun to match speeds with the tube, and injected in front of a plunger.  We can build the plunger with a space in its front that the car fits in.  This space has air trapped in it by the shock wave and so the people can breathe.  On its own that space would heat up due to the friction on the gas, but you can put a heat sink (eg a block of ice) in the car and keep it comfortable inside.  Near the end of your journey the plunger ejects the car from this space on a course that launches it out of the tube while the plunger continues on its way.  The car is then stopped with regenerative braking that recovers most of the launch energy, resulting in surprisingly little energy loss for taking the trip.

Now what would some of the specs be?  Well, Elon claims 30 minutes from downtown Los Angeles to downtown San Francisco.  According to google maps that's 382 miles, which is about 600 km.  So the loop should be going around 1200 km/hour.  If we put 12 plungers on the loop, and have plenty of vehicles, then you get in your vehicle and have a launch opportunity every 5 minutes.  Increase the number of plungers, and the time to launch can be decreased while the capacity of the system can be increased.  If we space the plungers a third of a km apart, we would have 3600 of them and could be launching every second into the system.  It probably is more efficient if you instead make plungers larger so that cars carry more people.  So instead of car think "bus".  But after the initial system is built, you can later add new entrance/exit ramps and ramp up capacity.  As Elon has promised, you would not need to reserve tickets - you'd pretty much arrive and then go.

Elon also claims that his system could store a lot of energy, enough to collect energy during the day and run off of it at night.  The obvious place to store energy is in the kinetic energy of the plungers.  How much energy are we talking?  Well 1200 km/hour is a third of a km per second.  That's 55.5 kJ of energy per kg of plunger.  So 64,800 kg (about 143,000 pounds) of plunger is a megawatt-hour of power.  Suppose that is one plunger.  If you've got a thousand plungers, and each is storing a full megawatt-hour, you could permanently consume 50 megawatts of energy.  You'd use up over half of it at night, then regain it during the day.  Trips taken in the early morning commute might take 50% longer than during the evening, but it is doable.  This gets better if we make plungers bigger, have a more efficient system, or have more plungers.  I'm sure that Elon has thought about the ideal parameters.  But if heavy plungers are good, well, put in enough metal for maglev to work and then add rock.  You'd store a lot of energy.

Heck, the solar panel angle is fun but not really necessary.  From the point of view of the electric power grid, it would be very, very good to have a large energy sink that can even out power fluctuations.    Renewable energy sources often arrive at different times than we'd like to get power out.  Sometimes, like with wind, we get very sharp spikes that we need to even out.  If designed properly, the Hyperloop can absorb pretty much any power spike, and can bleed enough power out to be interesting.  Therefore if the power utilities are smart then they should be willing to pay to add more plungers.  Not because they care about the fact that they are improving peak capacity and reducing waiting time, but because they want to be able to store more power in it.

I'm sure that there are many improvements on this design that I have not thought of but which Elon has.  I'm also sure that Elon has detailed blueprints that take this from a half-baked concept to something you can start to put cost estimates on.  But this idea looks doable to me, and looks like it could - at least in principle - justify all of the claims that Elon has been making for the Hyperloop.

Finally I'd love to see this built.  I'd love to see it built in California.  But, unless someone like Elon pushes it, I'd be willing to bet that the Chinese get it first.

BTW for further discussion, see Hacker News

Monday, October 29, 2012

A/B testing scale cheat sheet

This is not a guide to how to do A/B testing.  If you want that, see Effective A/B Testing, or any number of companies that will help you with A/B testing.  Instead this is a cheat sheet of basic facts on A/B testing (mostly on the scale involved) to help people who are beginning figure out what is feasible.
  • If you've never tested, expect to find a number of 5-20% wins.

    In my experience, most companies find several changes that each add in the neighborhood of 5-20% to the bottom line in their first year of testing.  If you have enough volume to reliably detect wins of this size in a reasonable time frame, you should start now.  If not, then you're not a great candidate for it..yet.
     
  • Experienced testers find smaller wins.

    When you first start testing, you pick up the low-lying fruit.  The rapid increase in profits can spoil you.  Over time the wins get smaller.  And more specific to your business.  Which means they will take longer to detect.  Expect this.  But if you're still finding an average of one 2% win every month, that is around a 25% improvement in conversion rates per year.
     
  • What works for others likely works for you.  But not always.

    Companies that test a lot which have settled on simple value propositions, streamline their signup process, put down big calls to action, places the same call to action in multiple places, and do email marketing.  Those are probably going to be good things for you to do as well.  However Amazon famously relies on customer ratings and reviews.  If you do not have their scale or product mix, you'll likely get very few reviews per product, and may get overwhelmingly negative reviews.  So borrow, but test.
     
  • Your testing methodology needs to be appropriate to your scale and experience.

    A company with 5k prospects per day might like to run a complex multivariate test all of the way to signed up prospect, and be able find subtle 1% conversion wins.  But they don't generate nearly enough data to do this.  They may need to be satisfied with simple tests on the top step of the conversion funnel.  But it would be a serious mistake for a company like Amazon to settle for such a poor testing methodology.  In general you should use the most sophisticated testing methodology that you generate enough data to carry off.
     
  • Back of the envelope: A 10% win typically needs around 800-1500 successes per version to be seen.

    One of the top questions people have is how long a test takes to run.  Unfortunately it depends on all sorts of things, including your traffic volume, conversion rates, the size of the win to be found, luck, and what confidence you cut the test off at.  But if one version gets to 800 successes, when the new one is at 880, you can convert at a 95% confidence level.  If you wait until you have 1500 versus 1650, you can convert at a 99% confidence level.  This data point, combined with your knowledge of your business, gives you a starting point for planning.
     
  • Back of the envelope: Sensitivity scales as the square root of the number of observations.

    For example a 5% win takes about 2x as much sensitivity as a 10% win, which means 4x as much data.  So you need 3200-6000 successes per version to see it.
     
  • Data required is roughly linear with number of versions.

    Running more versions requires a bit more data per version to reach confidence.  But not a lot.  Thus the amount of data you need is roughly proportional to the number of versions.  (But if some versions are real dogs, it is OK to randomly move people from those versions to other versions, which speeds up tests with a lot of versions.)  Before considering a complicated multivariate test, you should do a back of the envelope to see if it is feasible for your business.
     
  • Even if you knew the theoretical win, you can't predict how long it will actually take to within a factor of 3.

    An A/B test reaches confidence when the observed difference is bigger than chance alone can plausibly explain.  However your observed difference is the underlying signal plus a chance component.  If the chance component is in the same direction as the underlying signal, the test finishes very fast.  If the chance component is the opposite direction, then you need enough data that the underlying signal overrides the chance signal, and goes on to still be larger than chance could explain.  The difference in time is usually within a factor of 3 either way, but it is entirely luck which direction you get.  (The rough estimates above are not too far from where you've got a 50% chance of having an answer.)
     
  • The lead when you hit confidence is not your real win.

    This is the flip side of the above point.  It matters because someone usually has the thankless task of forecasting growth.  If you saw what looked like an 8% win, the real win could easily be 4%.  Or 12%.  Nailing that number down with any semblance of precision will take a lot more data, which means time and lost sales.  There generally isn't much business value in knowing how much better your better version is, but whoever draws up forecasts will find the lack of a precise answer inconvenient.
     
  • Test early, test often.

    Suppose that you have 3 changes to test.  If you run 3 tests, you can learn 3 facts.  If you run one test with all three changes, you don't know which change actually made a difference.  Small and granular tests therefore do more to sharpen your intuition about what works.
     
  • Testing one step in the conversion funnel is OK only if you're small and just beginning testing.

    Every business has some sort of conversion funnel which potential customers go through.  They see your ad, click on it, click on a registration link, actually sign up, etc.  As a business, you care about people who actually made you money.  Each step loses people.  Generally, whatever version pushes more people through the top step gets more business in the end.  Particularly if it is a big win.  But not always!  Therefore if testing eventual conversions takes you too long, and you're still finding 10%+ wins at the top step in your funnel, it makes business sense to test and run with those preliminary answers.  You'll make some mistakes, but you'll get more right than wrong.  Testing poorly is better than not testing at all.
     
  • People respond to change.

    If you change your email subject lines, people may be 2-5% more likely to click just because it is different, whether or not it is better.  Conversely moving a button on the website may decrease clicks because people don't see the button where they expect it.  If you've progressed to looking for small wins, then you should keep an eye out for tests where this is likely to matter, and try to dig a little deeper on this.
     
  • A/B testing revenue takes more data.  A lot more.

    How much more depends on your business.  But be ready to see data requirements rise by a factor of 10 or more.  Why?  In the majority of companies, a fairly small fraction of customers spend a lot more than average.  The detailed behavior of this subgroup matters a lot to revenue, so you need enough data to average out random fluctuations in this slice of the data.
     
  • Interaction effects are likely ignorable.

    Often people have several things that they would like to test at the same time.  If you have sufficient data, of course, you would like to look at each slice caused by a combination of possible versions separately, and look for interaction effects that might arise with specific combinations.  In my experience, most companies don't have enough volume to do that.  However if you assign people to test versions randomly, and apply common sense to avoid obvious interaction effects (eg red text on red background would cause an interaction effect), then you're probably OK.  Imperfect testing is better than not testing, and the imperfection of proceeding is generally pretty small.
     
As always, feedback is welcome.  I have helped a number of companies in the Los Angeles area on A/B testing, and this tries to  the most common questions that I've encountered about how much work it is, and what returns they can hope for.

Wednesday, October 17, 2012

My son's flashcard routine

My 7 year old son is in grade 2. In the previous grade, despite his intelligence, he was significantly behind his class in handwriting, letter reversals, and spelling. He was getting extra help from his teacher, but he still had an uphill battle. So I decided to start a flashcard routine to assist. This solved the original problem.  Here is a description of the current routine, and how it has evolved to this point.

It will surprise nobody who has read Teaching Linear Algebra that I started with the thought of some sort of spaced repetition system to maximize his long-term retention with a minimum of effort.  I needed to help him with around handwriting, so I wanted to be personally evaluating how he was doing.  This seemed simplest with a manual system.  I therefore settled on a variation of the Leitner system because that is easy to keep track of by hand.

To make things simple for me to track, I am doing things by powers of 2.  Every day we do the whole first pile.  Half of the second.  A quarter of the third.  And so on.  (Currently we top out at a 1/256th pile, but are not yet doing any cards from it.)  Cards that are done correctly move into the next pile. Those that he get wrong fall into the bad pile, which is the next day's every day pile.

So far, so good.  I tried this.  Then quickly found that I did an excellent job of sifting through all of the words he knew and getting the ones he didn't know into the bottom pile.  But he wasn't learning those.  This lead to frustration.  Not good.

I then added an extra drill on the pile that he got wrong.  At the end of the session, we do a quick drill with just the problem cards.  Here is the drill until we get to 3 cards.  If he gets the card first try, or gets a card that came all of the way from the bottom since he last got it wrong, it is removed from the drill.  If he gets it wrong, I tell him how to do it, and put it back in the pile near the top so he sees it again soon.  If he gets it right after a recent reminder, it goes to the bottom to get a chance to come out of the drill.

After we get down to 3 cards, I switch the drill up.  If he gets a card wrong I correct him and put it in slot 2.  If he gets it right I put it on the bottom.  Once he gets all three right, I end the drill for that day.

After I added this final drill on the problem cards, the "not learning" problem disappeared.  He began learning, and saw his school performance improve.  His spelling tests went from under half the words correct to the 80-100% range.  Everyone was happy.

It is worth noting that at the end of grade 1 he took several tests, and we found that he was spelling at a grade 3 level.  We have no direct measurement proving it, but I guarantee that he spells even better now.

This happiness lasted until he got used to doing well.  Over time we had more piles.  In school he was being given more words.  I began adding simple arithmetic facts.  This meant more and more work.  Not fun work.  Sometimes he would make a mistake on a card that he had known for a long time.  Then he'd get upset.  Once he got upset he'd get lots of others wrong.  Over the next few days we'd get the cards moving back up the piles, then it would happen again.  The flashcard routine became a point of conflict.

Then I had a great idea (which I borrowed from a speech therapist).  The idea is that I'd mix a reward activity and flashcards.  We'd start on the reward, then do a pile, go back to the reward, then do another pile, go back to the reward, and so on.  The specific reward activity that we're using is that I'm reading books to him that are beyond his current reading level (currently The Black Cauldron), but in principle it could be anything.  With this shift, the motivation problem completely disappeared.  He enjoys the reward.  The flashcards are a minor annoyance that gets him the reward.  If he goes off track, the reward restores his equilibrium.  Intellectually he's happy that he's mastering the material.  But the reward is motivation.

With this fix in place, we lasted several months.  Then we developed an issue.  A couple of words were sufficiently hard that they just stayed in the bad pile every day.  So I made a minor tweak.  I had been doing his top pile, then his next, then his next, on down.  But instead I do his every day pile.  Then go into the top pile, next, next, etc.  But after each of those groups I try him again on the every day words that he hasn't gotten right yet.  Thus he is forced to get his trouble words right 2x per day.  This helped him master them and got them moving back up.

With that fix, we lasted until this week.  This week we had a problem.  His spelling test for this week includes the word embarrassing.  (And he can get a bonus for knowing peculiar.)  The problem is that this word has enough spelling tricks to get it right that he simply cannot get it in one pass.  We tried several times, without success.  I therefore have added flashcards like em(barr)assing for which he gets told, "The word 'embarrassing' starts 'em'.  Write the 'barr' bit."  With these intermediate flashcards he seems to be breaking up learning the whole word into manageable tasks, from which he can learn the word itself.  But I've also generated a ton of temporary flashcards, which may become an issue.  (I plan on removing those piecemeal ones after he successfully gets them in the every 8 day pile.  In a few weeks I'll know how well this is working)

That brings us to the current state of his flashcard routine.  He currently has hundreds of spelling words and basic arithmetic facts learned.  373 of them learned sufficiently well that he reviews them less than once per month.  But I am sure that I'm not done tweaking.  Here are current issues:
  1. One week is not enough.  Every week he is given a new set of words to master.  But as anyone who has done spaced repetition knows, a week is not very long to master material.  Spaced repetition excels for memorizing a body of data over years, not one week.  On most weeks he is given a set of standard words to learn, and a set of words for bonus points.  With the bonus words he usually gets over 100% on his tests.  But we don't stop, so now he'd do substantially better on last week's test than he actually did last week.
  2. He's only learning what I know that he needs to.  This week I reached out to his teacher and said that I am doing flashcards with him, and looked for feedback on more ways to use them for his benefit.  She pointed out a number of things he can improve on, including common words that he has wrong, grammar, poems he is supposed to memorize, and geography that he is supposed to learn.  The flashcard routine can help with these issues in time, but I had not been aware that he needed it.  Better late than never...
  3. Work is climbing again.  Currently every day I add 2 cards.  Plus every week I add a spelling test of unpredictable size (this week 27, of which he already knew one).  This is increasing the size of the bottom piles, and the work has been increasing.  It is manageable, but I'm keeping my eye on it.
  4. This takes my time.  At the moment that's unavoidable.  One of the issues that we're still working on is handwriting, so there needs to be a human evaluation of what he's doing.  But still I'm taking an hour per day with this.  I think it is an hour well-spent that we both value.  However in a couple of years if his sister needs similar help, what then?  In the long run I'd love to offload the flashcards to a computer program, but the idea of a reward activity has to be in there.  All of the flashcard apps that I've seen assume that doing flashcards is itself a fun activity.  That will not work for my son.  Maybe I'm being too picky.  But I've developed opinions about what works while fine-tuning my son's system.  If there is something that fits that, I'd love to find it.
If you've wound up building a similar or different system to help your children learn, please tell me in the comments.  I've borrowed ideas from all over, and would be happy to try anything reasonable that gets suggested.

Monday, October 8, 2012

How reliable will the Falcon 9 be?

Let's apply statistics to see, based on current launch data, how reliable we predict that the Falcon 9 will be.

Falcon 9 just had a launch that succeeded despite an engine failure.  According to design parameters, it should be able to survive the failure of any two engines.  But the flight can be lost if we lose 3+ engines.  Exactly how reliable is the Falcon 9 design?

Let me first take a naive approach.  To date we've had 4 launches of the Falcon 9, each with 9 engines (that's the 9 in Falcon 9), and have seen one in flight failure.  The measured success rate of an engine is therefore 35/36.  With that in mind, we can produce the following figures.
  • Probability of no engine failures: (35/36)**9 * (1 - 35/36)**0 * (9 choose 0) = (35/36)**9 = 77.6%
  • Probability of 1 engine failure: (35/36)**8 * (1 - 35/36)**1 * (9 choose 1) = (35/36)**8 * (1/36) * 9 = 20.0%
  • Probability of 2 engine failures: (35/36)**7 * (1 - 35/36)**2 * (9 choose 2) = (35/36)**7 * (1/36)**2 * 36 = 1.8%
  • Probability of 3+ engine failures: 1 - above probabilities = 0.2% (actually 0.16%)

For comparison the US Space Shuttle had a failure rate of 2/135 which is about 1.5%.

So SpaceX flights are dangerous compared to most things that we do, but so far seem much better than any previous mode of transport, including the US Space Shuttle.  Which was previously the most reliable form of transport into space.  (Not the safest though!  Soyuz has that record because, unlike the Space Shuttle, they've demonstrated the ability to have passengers survive a catastrophic failure that aborted the mission.)

But is that the end of the story?  No!

Suppose that the true failure rate of each individual engine is actually 10%.  Then an exactly parallel calculation to the above will find that the failure rate of a rocket launch is 5.3%.  That doesn't sound very reliable!

However is it reasonable to think that 10% is a likely failure rate for the rocket?  Well suppose that before we had seen any launches that we thought that a 10% failure rate was equally likely as a failure rate of 1/36.  Our observation is 1 engine failure out of 36.  The odds of that exact observation with a 10% failure rate are 9.0%.  The odds of that observation with a failure rate of 1/37 are 37.3%. According to Bayes' theorem, the probabilities that we give to theories after making an observation should be proportional to our initial belief of the probability of that theory times the probability of the given observation under that theory.

That is a mouthful.  Let's look at numbers.  In this hypothetical scenario our initial belief was a 50% chance of a 10% failure rate, and a 50% chance of a failure rate of 1/36.  After observing 36 instances of engines lifting off with 1 failure, the 10% theory has probability proportional to 4.5%, while the 1/36 theory has probability proportional to 18.35%.  Thus our updated belief is that the 10% theory has likelihood 4.5/(4.5 + 18.35) = 0.199 = 20%.  (Without the intermediate rounding we'd actually be at 0.195.)  And the 1/36 theory has likelihood around 80%.  Then combining the predictions of the theories with the likelihood assigned to each theory we get an estimated failure rate of 0.053 * 0.195 + 0.0016 * 0.805 = 0.023= 1.16%.  Our confidence in the record put up by the Falcon 9 is not as good now!

Please note the following characteristics of this analysis:
  1. Observations do not tell us what reality is, they update our models of reality.
  2. A wide range of failure probabilities fit the limited observations that we have so far on the Falcon 9.
  3. With enough data, theories that are far away from the observed average become very unlikely.
Now a curious person might want to know what the odds of failure would be if we included more possible prior theories.  I whipped up a quick Perl script to do the calculation for an initial expectation that 0.00%, 0.01%, 0.02%, ..., 99.99%, 100% were all equally likely failure rates a priori.  When I run that script I get a probability of 0.0198180199757443, which is an estimated failure rate of about 2%.  If you start with different beliefs, you can generate very different specific numbers.  For an extreme instance if you believe that SpaceX is constantly improving, so their future engines are likely to be more reliable than their past ones, then ridiculously good numbers become very plausible.

However the bottom line is that we cannot yet, based on the data that we have so far, conclude that we have good evidence that the Falcon 9 actually will put up a better reliability record over its lifetime than previous space vehicles.