Machine learning (ML) and its applications have exploded in the past decade with every company wanting to incorporate “Machine learning” or “Artificial Intelligence” into their product. With it comes an increasing investment into Machine learning research and publications. Machine learning has been extensively used in a various applications ranging from facial recognition in photos to automatic activity recognition in wearables to self driving vehicles. Every paper claims to out do the state of the art results, with some models reaching nearly hundred percent. When they are tested in the real world scenario, many of these results fall flat on their faces. So the question arises : Can these results be believed?

What are the causes?

Machine learning is a function approximation approach to solving problems, i.e the ML algorithms try to approximate some unknown function by fitting a curve along multiple dimensions to given data.

classification example

This means that the performance of the ML models is dependent on the training data used. The performance generally improves by increasing number of samples and the distribution of data. If the models are well trained, they are able to generalize to unseen samples with similar performance. However, it can also occur that the model fits the function to training data perfectly but will not work on unseen data.

classification overfit example

This overfitting result shows 100% on training samples only and with separate training and testing samples, we can detect that a model has been overfitted.

train test split example

As previously mentioned, the models can perform well with increased samples. Some classes can have many data samples but other classes can have very few samples. For instance, if we were to try and find cancer patients in a hospital, it is easier to find non cancer patients and so the data is likely to reflect this. The number of samples per class will be very imbalanced. This can lead to the model learning only the majority class since it will encounter it more often.

So many papers utilise techniques to balance the number of sample for each class. The most trivial solution is to downsample the majority class ( undersampling) to bring the classes closer to each other. Another solution is to increase samples of minority classes (over sampling). Each of the techniques have their own benefit and drawbacks. Other algorithms can do both over sampling and under sampling to bring the two techniques together.

One common pitfall with these class balancing techniques is that there is often leakage of information from training samples to testing samples which can inflate the results without generalizing to real world applications. We discuss two cases of how that would be possible:

Imputations : Imputation refers to replacing non existing or Not a Number (NaN) features in the samples by a statistical indicator of that features derived from other existing features. Many cases, the imputation of the features are performed on the whole dataset so the test samples gain some information which is shared with the training samples and vice versa. While this may seem like a small issue to humans, ML techniques have long been able to correlate and extract information from the ‘insignificant’ information.
Splitting after class balancing : Splitting of datasets is done by shuffling all available samples and choosing percentages of sample for each split. If there is a 75%/25% class split in the original dataset, the splits are also likely to have roughly the same percentage split. If we were to apply class balancing, the percentages would shift closer to 50%/50% (an ideal case) which would be reflected in the splits as well. So if we do class balacing before splitting, samples in the training split will also be repeated in the testing split. This causes the models to see the testing samples during training which would give an exaggerated testing results.

Ideal usage for class balancing

Algorithm:

Split dataset → train, validation, test
Imputation for all datasets using features from train only, i.e., take the statistical feature from train and replace missing values in train test and validation with those values.
Class balacing applied only on train.

Conclusion

We discuss only a couple of possible pitfalls during training of ML models. With this knowledge, we hope that you will be able to identify when a result seems too good to be true and hopefully avoid the mistakes in your own models as well. This blog post was inspired by a paper published in Scientific reports which made these mistakes. We have written an in depth exploration of the mistakes made and the result after they have been corrected.

Are the results too good to be true? Common causes

What are the causes?

Ideal usage for class balancing

Conclusion

IDLab PreDiCT