In our online display advertising targeting business, we have built unified experiment platform to serve different needs on experiment, especially, about our audience models. We have thousands of audience models, and each model are daily scored and used by deliver engine. The model quality directly determines the model performance. To improve the model quality, the experiment platform provides offline/online ways to accelerate the experiment (as below picture show). The offline experiment is majorly conducted by historical evaluation. And online is doing through the live a/b testing.
After we build out the models, we can evaluate based on the historical performance data of specified ads. The ads selecting is important since different ads have different performance data and performance of models against different ads varies sharply. The data analyst whose duty is to train the model and verify the model quality can choose what kinds of KPIs will be used in the evaluation process – Most common used KPIs include CTR, Reach, Targetable Impression, Demographics, Behaviors. Current, we don't have overall evaluation criteria for historical evaluation, this makes the HE cannot automatically make a conclusion on whether the model is good/bad. To detailed KPIS help analyst deeply understand the created models.
After getting the historical evaluation result, some models are ready to go live but to reduce the chance of inconsistence performance between offline and online, and make the model online more smoothly, we have to start live a/b testing between the current live model and new model which has passed historical evaluation.
The live a/b testing is quite straightforward – comparing the performance (CTR) of model A with model B in a continuous period. If we get a statistical significant result, the model will be published and make it live immediately.
Each tools in experiment platform involves the statistics knowledge and I will take the live a/b testing for example to blog introductory statistics.