KFold Cross Validation does not fix overfitting


I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.

This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this

import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt from sklearn.impute import SimpleImputer from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestRegressor import scipy.stats as ss avo_sales = pd.read_csv('avocados.csv') avo_sales.rename(columns = {'4046':'small PLU sold', '4225':'large PLU sold', '4770':'xlarge PLU sold'}, inplace= True) avo_sales.columns = avo_sales.columns.str.replace(' ','') x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1)) y = np.array(avo_sales.TotalBags) # X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2) kf = KFold(n_splits=10) for train_index, test_index in kf.split(x): X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index] impC = SimpleImputer(strategy='most_frequent') X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel() X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel() imp = SimpleImputer(strategy='median') X_train[:,1:8] = imp.fit_transform(X_train[:,1:8]) X_test[:,1:8] = imp.transform(X_test[:,1:8]) le = LabelEncoder() X_train[:,8] = le.fit_transform(X_train[:,8]) X_test[:,8] = le.transform(X_test[:,8]) rfr = RandomForestRegressor() rfr.fit(X_train, y_train) confidence = rfr.score(X_test, y_test) print(confidence)

The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:

As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:

<ul><li>max_features</li> <li>min_samples_leaf</li> <li>max_depth</li> </ul>

With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:

To regularize your model, instantiate the RandomForestRegressor() module like this:

rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)

These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.

Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.


While @NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:


Cross-validation is a method that improves the performance of a machine learning model.


But this is <em>not</em> the case.

Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy. In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.

Example: Let's look at a dataset with 10 points, and fit a line through it:

import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression X = np.random.randint(0,10,10) Y = np.random.randint(0,10,10) fig = plt.figure(figsize=(1,10)) def line(x, slope, intercept): return slope * x + intercept for i in range(5): # note that this is not technically 5-fold cross-validation # because I allow the same datapoint to go into the test set # several times. For illustrative purposes it is fine imho. test_indices = np.random.choice(np.arange(10),2) train_indices = list(set(range(10))-set(test_indices)) # get train and test sets X_train, Y_train = X[train_indices], Y[train_indices] X_test, Y_test = X[test_indices], Y[test_indices] # training set has one feature and multiple entries # so, reshape(-1,1) X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1) # fit and evaluate linear regression reg = LinearRegression().fit(X_train, Y_train) score_train = reg.score(X_train, Y_train) score_test = reg.score(X_test, Y_test) # extract coefficients from model: slope, intercept = reg.coef_[0], reg.intercept_[0] print(score_test) # show train and test sets plt.subplot(5,1,i+1) plt.scatter(X_train, Y_train, c='k') plt.scatter(X_test, Y_test, c='r') # draw regression line plt.plot(np.arange(10), line(np.arange(10), slope, intercept)) plt.ylim(0,10) plt.xlim(0,10) plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))

You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary <em>a lot</em> with the change of train and test set.

That does not make your linear model any better at all. But now you know exactly how bad it is :)



  • How can I add a plural-s after a Sphinx :class: directive
  • BAT file to index folder structure to HTML
  • How to check conversion from C++ string to unsigned int
  • PostgreSQL: concatenate nested arrays with differing element dimensions
  • Using multiple different group_by variables (dplyr) to summarise a dataframe
  • jQuery Waypoints - multiple divs with same CLASS
  • How do I get data back from Paypal so I can alter my MySQL database accordingly?
  • How do I insert a line break in an xtable caption?
  • TeamCity: Scripting elements jsp:declaration, jsp:expression, jsp:scriptlet are disallowed here
  • Azure function C#: Create or replace document in cosmos db on HTTP request
  • Byte Array to *Signed* Int
  • How to use Sanitize on HTML Entity
  • Simulating a FULL OUTER JOIN in Access
  • Create a Windows driver to access network storage
  • Less than comparison for date in spark scala rdd
  • Django IN query as a string result - invalid literal for int() with base 10
  • Project Euler -Prob. #20 (Lua)
  • Binding json result in highcharts for asp.net mvc 4
  • Spring Cloud Config - Multiple Composite Repositories?
  • Spring annotation @Order
  • internal javascript not works in angular2
  • Django REST framework - HyperlinkedRelatedField with additional parameter
  • What are advantages/disadvantages of using Selenium for Java vs .NET applications?
  • Set WebClient.Builder.exchangeStrategies() without losing Spring Jackson configuration
  • Year over Year Stats from a Crossfilter Dataset
  • Creating 2d platforms using JavaScript
  • openpyxl - adding new rows in excel file with merged cell existing
  • How to define something in JavaScript [closed]
  • C++ STL stack pop operation giving segmentation fault
  • Cloud Code: Creating a Parse.File from URL
  • How do I add a mouse over tooltip to an Image using .DrawImage()
  • How to integrate angular2-material (alpha 8.2) with angular2-Quickstart app
  • Google App Engine Datastore: Dealing with eventual consistency
  • How to get rgb from transparent pixel in js
  • Capture SIGFPE from SIMD instruction
  • Using Service Component Runtime
  • How do I use TagLib-Sharp to write custom (PRIV) ID3 frames?
  • CAS 4 - Not able to retrieve the LDAP groups after successful authentication
  • convert json to excel in java