Online Controlled Experimentation (also known as A/B Testing) is a subject I will re-visit often as I am Test Manager for Microsoft's Experimentation Platform (ExP) which enables properties at Microsoft to conduct such testing. We even have a shiny logo:
To produce software that makes our customers giddy we need to understand what those rascally customers want. To that end, we can do the following:
The way of watching our users' behavior that I want to expound upon here however is Online Experimentation.
…well probably…assuming you use one of Microsoft's several web sites or Amazon.com or eBay or Google. Online experimentation is a simple concept. All users request to go to a specific web page, but behind the scenes some percent of users are shown a different page than the others. Users do not know they are in an experiment, all they know is that they asked for a web address (URL) and were served a page. The point is that some users might see the page as it's been for months (the control), while some might see the page with some new feature(s) on it (the treatment). We then can measure what users actually do on these respective pages to find out which one is "better". This diagram offers a simplistic view of it:
I say it's simplistic because there can be more than just two variations of the page, and the users need not be split 50/50. We can have two variants split 90/10, or four variants split 10/10/40/40, etc.
The point of experimentation is to determine which implementation is "better", but what do we mean by better? First we must look at what we can measure, and then among that how do we decide which one of those measurements can measure "betterness" (or “betterosity” of you prefer).
As for what we are measuring, ExP as a highly flexible, highly customizable system can measure almost every client-side or server side event from clicks to page views to page requests to even dwell times, and then slice and dice these by user, per hour, per day, or per session. On the other hand a service like Google Website Optimizer, which is free and widely available, but much more limited and offers a fixed set of observations that are available focused on the click-through-rate (clicks per page view).
Regarding how do we define better for ExP experiments, although we collect many different data metrics, we define one as the Overall Evaluation Criterion (OEC). It is this metric that will define whether the treatment is a success or not over the control. In picking the right OEC, you want to have a strong business driver. Revenue is often a good OEC, but remember also to tie it to the long term goals of your website such as customer lifetime value. For a retail site, you can measure sales per user and see if there is an increase (and this would be a good OEC), but conversions per new user (i.e. new users who made a purchase) may be more interesting for the long run of your business.
Keep in mind that just because you are have decided what to measure, are measuring it, and you see that on average B > A, this does not mean you actually have a real difference (that is to say something actionable). The Experimentation Platform employs advanced statistical methods to tell you whether your results are statistically significant. The alternative is that the observed difference in average values in A and B is due to chance. We generally consider the results statistically significant if there is a 5% or lower possibility that the results could have occurred by chance, or put another way that we have a 95% or higher confidence in the results. Of course higher than 95% is better and often seen in ExP experiments.
So why go through this trouble? Why not simply take down the old website, put up the new one, and then do all the same measurements of user behavior and see if there are any differences? Because then your data will be worthless because you are no longer controlling for the effect of things other than the variant or feature you want to test. What if Oprah mentions your website sometime around the transition period, or what if an ad campaign starts running, or if you have network problems, or any of thousands of events occur that can affect user behavior and your user traffic? You have no way of determining how much of the effect was caused by your change from website A to B and how much was caused by the other effect. That is why only Controlled Experiments are useful.
Here is an example of an experiment that was run. The MSN Real Estate site wanted to test different designs for their “Find a home” widget. Visitors to this widget were sent to Microsoft partner sites from which MSN Real estate earns a referral fee. Six different designs, including the incumbent, were tested. Can you guess which one was the winner?
The winner, Treatment 5, increased revenues from referrals by almost 10% (due to increased clickthrough). The Return-On-Investment (ROI) was phenomenal. [Kohavi, et. al. 2009].
Why did Treatment 5 win? Well, experimentation tells you what happens. The why is not always clear but after running many experiments on many sites certain trends do emerge. One is that tabbed designs (Treatments 1, 2, 3, and 4) tend to test poorly. They present a clean layout but add extra friction (clicking through the tabs) between the user and the functionality. Also the use of the single input field over the multiple input fields tends to be a winner also.
Google's top designer Doug Bowman resigned his post claiming that too many design decisions were made based on experimentation and that "data eventually becomes a crutch for every decision". He specifically decried "that a team at Google couldn’t decide between two blues, so they’re testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4 or 5 pixels wide". It sounds to me like Google needs some help on choosing their OEC. Perhaps it was not experimentation that drove Doug away, but poor experiment design.
Some in software have claimed that experimentation is good for incremental feedback, but not so good (or actually prevents you from) the new and innovative. Again, I think this goes to the experiment design. You can design incremental experiments and realize value from them, or you can design bold experiments and leverage those to make large innovative changes. Experimentation actually drives innovation by allowing you to try many things in the real world and see what works rather than build up a huge feature set in the isolation of development only to release every 6 months or 1 year to find half the features were wasted work (or worse negative contributors).
On the ExP team our mascot is the HiPPO which stands for "Highest Paid Person's Opinion". In the time before Experimentation, in a land without data, this was the person who decided what got deployed and what did not. Beware of HiPPOs, they kill innovative ideas.