ML - Dataset Oversampling

Ifrim Ciprian
Mar 17, 2022
1 min read

Updated: Apr 7, 2022

As stated in the last post, the dataset is not actually balanced between classes, which is normal, as raining is more common than snow, however, estimators expect balanced datasets.

Here is the actual distribution:

In order to balance the different classes, a common technique is to use oversampling. There are 2 oversampling methods that are mainly used SMOTE and ADASYN:

SMOTE stands for Synthetic Minority Over-Sampling Technique. SMOTE is performing the same basic task as basic resampling (creating new data points for the minority class) but instead of simply duplicating observations, it creates new observations along the lines of a randomly chosen point and its nearest neighbors.

ADASYN is a improved version of Smote. What it does is same as SMOTE just with a minor improvement. After creating those sample it adds a random small values to the points thus making it more realistic.

The following figure compares the Scatter Plots of the original data against the 2 methods used for oversampling. We can note how the SNOW class, in orange, has highly increased, and has patterns generated by the K-Nearest Neightbour.

After testing all the estimators, the perfomance per class with the original 20 year dataset which was the best, and oversampled with both methods, looks as follows:

As expected, with the oversampling, the classifier became better at identifing all classes, but the performance on the Rain and Fair classes was lower.

In my case, since the Rain and Fair classes are the most important, I will use the default dataset, as it also represents the actual natural events and probabilities.

The GitHub is here: https://github.com/CiprianFlorin-Ifrim/Dataset_Oversampling

ML - Dataset Oversampling

Recent Posts

Commenti