Learning from uncooperative A/B testers

One of the joys of working at a tiny startup packed into an ill-equipped, too-small space was running an account at Khaladi Brothers, the coffee shop across the street, because all small meetings had to be done outside the office. As the top coffee nerd, I took on running fresh vacuum pots over (and yelling “Fresh pot!” as I entered) and exchanging the empty ones. When we moved to a newer, spacious, swanky, and quite expensive office space (hot tip to startups: don’t do this) with an actual kitchen and drip coffee maker, I was put in charge of deciding which coffee beans we’d order. We had many options and an office of high-volume consumers with strong opinions on everything, and needed to get down to one or two for the recurring bulk order.

Naturally, as a Product Manager, I decided to do selection through a series of A/B tests.

Must-have for the tests:

  • end up with a winner
  • clear methodology, publicly exposed
  • easy to participate — or not
  • takes as little as my time as possible, because this was amusing but everything’s on fire all the time
  • keep momentum out of it (so the first voter didn’t disproportionately determine the day’s winner)

I discarded forced-choice, so coffee drinkers didn’t have to vote if they didn’t feel like trying both or didn’t find a winner, I decided against setting up a dedicated “today’s test” table or doing “three-sample can you tell if one’s different” type advanced testing, I didn’t try to timestamp votes to determine if one did well fresh and one did well through the day… nope!

I went straight single-bracket, winner-advances, random seeding at each round. Every day I tried to get to the office before everyone, and made two giant pots of coffee labelled “A” and “B”. If someone wanted to vote for a winner, they could write it down and drop it in a tin, which I tallied at the end of the day. I will admit that having come out of Expedia, where our A/B tests were at colossal scale with live customers, this whole thing seemed trivial and I didn’t spend as much time as I might have.

You may already see where some of this is going. “I know! I too am from the future,” as Mike Birbiglia says.

It was not trivial, and I ended up learning from the experience.

Test assumptions, set baselines: I didn’t have 32 coffees, which was good, because some days I did an A/A test to see what the difference would be. I was surprised, on those days voting for winners was down, and results were remarkably close to 50%/50% — and the highest split was 58% (10/17), which was a vote off a straight split. 

Know that blind tests mean subjects may reject results or: Starbucks did really well. I don’t know what to say. I figured they’d barely beat out the clearly awful generic ones and Tully’s, but some of their whole beans did well got all the way to the semi-finals. Participants were not happy to learn this, came by to ask questions, and generally were reluctant to accept that they’d preferred it. If a Starbucks bean had won but it had made people unhappy, would I have gone through with ordering it? I’m glad I didn’t have to confront that.

Also… yeah, Seattleites have issues with Starbucks.

Consider the potential cost of testing itself. The relatively small amount of time I thought it would take each day turned into way more effort than I’d hoped. Doing testing in public is a colossal hassle. Even having told everyone how I was doing it, during the month this went on, there were those offering constant feedback:

  • it should be double-blind so I don’t know which pot is which
  • it should have three pots, and they might all be the same, or different
  • no, they’re wrong…
  • it’s easy to come in early and see which one I’m making

…and so on. By week two, getting up early to make two pots of coffee as someone offered methodological criticism was an A/B trial of my patience.

If testers can tamper, they will — how will you deal with it? For one example, I came into the kitchen one day to get a refill and a developer was telling everyone he knew what pot was which because he’d seen me brewing and had poured an early cup off that, and so knew the pot with the lower level indicator was that batch. He was clearly delighted to tell everyone drinking coffee which one they’d picked. I honored the day’s results anyway.

This kind of thing happened all the time. At one point I was making the coffee in a conference room to keep the day’s coffees concealed. In a conference room! Like a barbarian!

I was reminded of the perils of pricing A/B experiments, which Amazon was being called out for at the time — if customers know they might be part of a test and start clearing their browser cookies and trying to get into the right bucket, how does that skew the results? “People who reloaded the page over four times converted at a much higher rate… we should encourage refreshing!”

Think through potential “margin of error” decisions when structuring tests. There was a coffee I liked that dominated early rounds and then in the semi-finals lost by two votes to a coffee that had squeaked by in previous rounds by 1-2 votes each time. What should I have done in cases where the vote was so close? I’d decided the winner by any margin would advance, but was that the way it should have been? Should I have had a loser bracket?

In the end, we had a winner, and it was quite good — and far better than what the default choice would have been — but I was left unsatisfied. I’d met the requirements for the test, it’d been a pain in the ass for me but not taken that much time. I couldn’t help but think though that if I’d just set up a giant tasting session for anyone who cared, and let them vote all at once, I’d have saved everyone a lot of trouble and possibly had a better result.

But more importantly, like every other time I’ve done A/B testing in my product management career, the time I spent on the test and in thinking through its implications and the process helped me in every subsequent test, and was well worth it. I encourage everyone to find places to do this kind of lightweight learning. Surely there are dog owners out there wondering what treats are best, and dogs who would be happy to participate (and cheat, if you’re not wary).

Go forth!

Leave a Reply

Your email address will not be published. Required fields are marked *