Recently I happened work on testing classifiers at my work. As a pretext I did some study on performance evaluation of classifiers and here are my thoughts... Classification is the task of assigning objects to one of several predefined categories. In most of the cases the classifier is supervised. In the sense human supervision is required to train the classifier using data that is pre categorized. Training is the process of providing the classifier data with categories identified so that a model is induced in it. The classifier would then use this induced model to determine categories of new objects.

Given two classifiers how would we compare them? Which one performs better? The performance of a classifier can be defined as the measure of its ability to accurately identify category for new objects. To identify this, we could use the pre-classified data, run it through the classifier and match the results.


The following are some useful ways of evaluate performance of a classifier.

Hold Out Method

In this method the set of labeled data is split into two groups. One is used to train the classifier while the other is used to test the classifier. There is no fixed rule for the proportion of the data to be used. A good proportion would induce a good model and in turn produce better test results.

The drawbacks would be that we would not be using all the data available for our training. Second, if we use a huge training set we might be unsure about the accuracy that the small test set produces. On the other hand if we use a small training set the model induced might have lot of variance. Finally, since the test and training set are not independent of each other it’s always the case that the class that is over represented in one set it underrepresented in the other and vice versa.

Random Sub-sampling

Random sampling is using hold out method multiple times. If we sub-sample the data ‘n’ times, the accuracy of the model would be calculated as the average of the accuracies of the all the sub-sampled runs.

Along with some of the drawbacks of the hold out method, this method doesn’t have any control on the number of times a particular data was used in testing and/or training.

Cross Validation

In this approach each record is used the same number of times for training and exactly once for testing.  To illustrate this technique, consider splitting the data into two equal partitions. The first partition is used as the training set while the other is used as a test set. Next we swap the partitions and use the second as the training set while the first as the test set. In general a k-fold cross validation would be to divide the data into k equal partition. During each run one of the partitions would be used for testing while the rest would be used for training. And we could have k such runs.

Cross validation works well only if the training set and the test set are drawn from the same population which represents all classes of data well.


All the above the training data is sampled without replacement. Hence there were not duplicate records in the training and test sets. Bootstrap method goes a step further and selects training records with replacement i.e. a record already chosen for training is put back into the original pool so that it’s equally likely to be picked up again. In every run the unselected records are used as the test set. The sampling procedure is repeated to produce bootstrap samples.

Reference and Further Reading

Introduction to Data Mining (Pang-Ning Tan, Micheal Steinback and Vipin Kumar)