Some blog posts written by our team are given below.
Machine learning (ML) and its applications have exploded in the past decade with every company wanting to incorporate “Machine learning” or “Artificial Intelligence” into their product. With it comes an increasing investment into Machine learning research and publications. Machine learning has been extensively used in a various applications ranging from facial recognition in photos to automatic activity recognition in wearables to self driving vehicles. Every paper claims to out do the state of the art results, with some models reaching nearly hundred percent. When they are tested in the real world scenario, many of these results fall flat on their faces. So the question arises : Can these results be believed?
Machine learning is a function approximation approach to solving problems, i.e the ML algorithms try to approximate some unknown function by fitting a curve along multiple dimensions to given data.
This means that the performance of the ML models is dependent on the training data used. The performance generally improves by increasing number of samples and the distribution of data. If the models are well trained, they are able to generalize to unseen samples with similar performance. However, it can also occur that the model fits the function to training data perfectly but will not work on unseen data.
This overfitting result shows 100% on training samples only and with separate training and testing samples, we can detect that a model has been overfitted.
As previously mentioned, the models can perform well with increased samples. Some classes can have many data samples but other classes can have very few samples. For instance, if we were to try and find cancer patients in a hospital, it is easier to find non cancer patients and so the data is likely to reflect this. The number of samples per class will be very imbalanced. This can lead to the model learning only the majority class since it will encounter it more often.
So many papers utilise techniques to balance the number of sample for each class. The most trivial solution is to downsample the majority class ( undersampling) to bring the classes closer to each other. Another solution is to increase samples of minority classes (over sampling). Each of the techniques have their own benefit and drawbacks. Other algorithms can do both over sampling and under sampling to bring the two techniques together.
One common pitfall with these class balancing techniques is that there is often leakage of information from training samples to testing samples which can inflate the results without generalizing to real world applications. We discuss two cases of how that would be possible :
Algorithm :
We discuss only a couple of possible pitfalls during training of ML models. With this knowledge, we hope that you will be able to identify when a result seems too good to be true and hopefully avoid the mistakes in your own models as well. This blog post was inspired by a paper published in Scientific reports which made these mistakes. We have written an in depth exploration of the mistakes made and the result after they have been corrected.