Learning from uncooperative A/B testers

One of the joys of working at a tiny startup packed into an ill-equipped, too-small space was running an account at Khaladi Brothers, the coffee shop across the street, because all small meetings had to be done outside the office. As the top coffee nerd, I took on running fresh vacuum pots over (and yelling â€œFresh pot!â€ as I entered) and exchanging the empty ones. When we moved to a newer, spacious, swanky, and quite expensive office space (hot tip to startups: donâ€™t do this) with an actual kitchen and drip coffee maker, I was put in charge of deciding which coffee beans weâ€™d order. We had many options and an office of high-volume consumers with strong opinions on everything, and needed to get down to one or two for the recurring bulk order.

Naturally, as a Product Manager, I decided to do selection through a series of A/B tests.

Must-have for the tests:

end up with a winner
clear methodology, publicly exposed
easy to participate â€” or not
takes as little as my time as possible, because this was amusing but everythingâ€™s on fire all the time
keep momentum out of it (so the first voter didnâ€™t disproportionately determine the dayâ€™s winner)

I discarded forced-choice, so coffee drinkers didnâ€™t have to vote if they didnâ€™t feel like trying both or didnâ€™t find a winner, I decided against setting up a dedicated â€œtodayâ€™s testâ€ table or doing â€œthree-sample can you tell if oneâ€™s differentâ€ type advanced testing, I didnâ€™t try to timestamp votes to determine if one did well fresh and one did well through the day… nope!

I went straight single-bracket, winner-advances, random seeding at each round. Every day I tried to get to the office before everyone, and made two giant pots of coffee labelled “A” and “B”. If someone wanted to vote for a winner, they could write it down and drop it in a tin, which I tallied at the end of the day. I will admit that having come out of Expedia, where our A/B tests were at colossal scale with live customers, this whole thing seemed trivial and I didnâ€™t spend as much time as I might have.

You may already see where some of this is going. â€œI know! I too am from the future,â€ as Mike Birbiglia says.

It was not trivial, and I ended up learning from the experience.

Test assumptions, set baselines: I didnâ€™t have 32 coffees, which was good, because some days I did an A/A test to see what the difference would be. I was surprised, on those days voting for winners was down, and results were remarkably close to 50%/50% â€” and the highest split was 58% (10/17), which was a vote off a straight split.

Know that blind tests mean subjects may reject results or: Starbucks did really well. I donâ€™t know what to say. I figured theyâ€™d barely beat out the clearly awful generic ones and Tullyâ€™s, but some of their whole beans did well got all the way to the semi-finals. Participants were not happy to learn this, came by to ask questions, and generally were reluctant to accept that they’d preferred it. If a Starbucks bean had won but it had made people unhappy, would I have gone through with ordering it? I’m glad I didn’t have to confront that.

Also… yeah, Seattleites have issues with Starbucks.

Consider the potential cost of testing itself. The relatively small amount of time I thought it would take each day turned into way more effort than Iâ€™d hoped. Doing testing in public is a colossal hassle. Even having told everyone how I was doing it, during the month this went on, there were those offering constant feedback:

it should be double-blind so I donâ€™t know which pot is which
it should have three pots, and they might all be the same, or different
no, theyâ€™re wrong…
itâ€™s easy to come in early and see which one Iâ€™m making

â€¦and so on. By week two, getting up early to make two pots of coffee as someone offered methodological criticism was an A/B trial of my patience.

If testers can tamper, they will â€” how will you deal with it? For one example, I came into the kitchen one day to get a refill and a developer was telling everyone he knew what pot was which because heâ€™d seen me brewing and had poured an early cup off that, and so knew the pot with the lower level indicator was that batch. He was clearly delighted to tell everyone drinking coffee which one theyâ€™d picked. I honored the dayâ€™s results anyway.

This kind of thing happened all the time. At one point I was making the coffee in a conference room to keep the day’s coffees concealed. In a conference room! Like a barbarian!

I was reminded of the perils of pricing A/B experiments, which Amazon was being called out for at the time â€” if customers know they might be part of a test and start clearing their browser cookies and trying to get into the right bucket, how does that skew the results? â€œPeople who reloaded the page over four times converted at a much higher rateâ€¦ we should encourage refreshing!”

Think through potential â€œmargin of errorâ€ decisions when structuring tests. There was a coffee I liked that dominated early rounds and then in the semi-finals lost by two votes to a coffee that had squeaked by in previous rounds by 1-2 votes each time. What should I have done in cases where the vote was so close? Iâ€™d decided the winner by any margin would advance, but was that the way it should have been? Should I have had a loser bracket?

In the end, we had a winner, and it was quite good â€” and far better than what the default choice would have been â€” but I was left unsatisfied. Iâ€™d met the requirements for the test, itâ€™d been a pain in the ass for me but not taken that much time. I couldnâ€™t help but think though that if Iâ€™d just set up a giant tasting session for anyone who cared, and let them vote all at once, Iâ€™d have saved everyone a lot of trouble and possibly had a better result.

But more importantly, like every other time Iâ€™ve done A/B testing in my product management career, the time I spent on the test and in thinking through its implications and the process helped me in every subsequent test, and was well worth it. I encourage everyone to find places to do this kind of lightweight learning. Surely there are dog owners out there wondering what treats are best, and dogs who would be happy to participate (and cheat, if youâ€™re not wary).

Go forth!

Hate Life, Will Travel

Occasional musings of Derek Zumsteg

Learning from uncooperative A/B testers

Leave a Reply

Share this:

Leave a Reply