With the use of Machine Learning Estimators, I can use the data from my sensors, and compare it to historical past data, in order to estimate how much rainfall there will be.
For the dataset, I have used the website: https://www.visualcrossing.com/weather/weather-data-services# and https://www.wunderground.com/history/daily/EGLC/date/2022-3-17
I have tested multiple datasets as follows:
train dataset 20 years = 2000-2019 - daily averages - 5 features/test dataset 2 years = 2020-2021
train dataset 2 years - 2021-2022 - daily averages - 3 features/test dataset 1 year = 2019
train dataset 2 years - 2021-2022 - daily averages - 6 features/test dataset 1 year = 2019
train dataset 2 years - 2021-2022 - hourly averages - 6 features/test dataset 1 year = 2019
train dataset 4 years - 2018-2021 - daily averages - 6 features/test dataset 3 months = jan-march 2022
train dataset 4 years - 2018-2021 - hourly averages - 6 features/test dataset 3 months = jan-march 2022
The features are as follows:
3 features: temp, humidity, pressure
5 features: temp, feels like temp, dew point, humidity, pressure
6 features: temp, feels like temp, dew point, humidity, pressure, uv index
And for the Regression estimators, I have used:
KNN Regression
Decision Tree Regression
Random Forest Regression
I have created a GitHub Repo here: https://github.com/CiprianFlorin-Ifrim/ML_Precipitation_Regression
The code is outputting as CSV file the actual prediction, as txt file the performance metrics and as png, the mean squared and mean absolute errors.

The plots look as following:


Here are my findings:
For the 3 features datset:

Metrics:

Here are some of the plotted histograms:

The same process was applied to all datasets, and the findings can be found in the 4 excel files that have been added at the bottom of the blog.
The estimators have been tested both as default, as well as optimised with grid search or random search.
The best estimator is a Decision Tree Classifier, default, with the highest depth/pure leaves.
Which looks as following, when simplified to a depth of 4:

It reaches an R2 Error of only 10.34%, and a Mean Absolute Deviation of 9.29%. In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the
independent variable.

Here is a histogram on the 2 years Test Dataset of true values vs predicted values:

Therefore, we can conclude that the model is highly accurate.
So now, thanks to the Micromlgen library we can port this library to the Arduino: https://github.com/eloquentarduino/micromlgen
Comentários