In this project we received two datasets without any information, one for a regression task and another for a classification task. The main goal was to apply basic machine learning techniques such as exploratory data analysis, feature engineering, k-fold cross validation etc. and use some linear models in order to be prepared for the second project.
Exploratory data analysis & feature processing
Learning about the characteristics of the data is very important, we might see correlations among features, if the data are visually linearly separable, we can identify outliers and remove them for the training, estimate the data distribution etc. Moreover, we may start thinking about what transformation could we do in order express better our data.
In order to have the best root mean square error (RMSE) or a 0-1 loss for a given model, we have to use a validation set which will allow us to find the parameters which give the lowest error for our model. For example, a ridge regression with a polynomial basis has two parameters : the regularization parameter and the degree of the polynomial basis. We especially need to be careful about underfitting and overfitting.
All the models used in this project were linear with eventually a non-linear transformation of the data. In order to find the best model among all our tuned models, we need a test set to approximate the unbiased generalized error. One possible way to do that is to use k-fold cross validation, e.g. 60% of the data for the training, 20% for the validation and 20% for the testing.