66687

NaNs suddenly appearing for sklearn KFolds

Question:

I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before?

y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']] X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444)

This is what my X data looked like before KFolds: variation length tempo 0 0.005144 1183.148118 135.999178 1 0.002595 720.165442 117.453835 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178 5 0.001620 560.365714 151.999081 6 0.002513 763.377778 107.666016 7 0.009262 502.083628 99.384014 8 0.000610 500.017052 143.554688 9 0.000733 269.001723 117.453835

My Y data looks like this: array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)

Now when I try to do the cross val:

kf = KFold(X_train.shape[0], n_folds=4, shuffle=True) for train_index, val_index in kf: cv_train_x = X_train.ix[train_index] cv_val_x = X_train.ix[val_index] cv_train_y = y_train[train_index] cv_val_y = y_train[val_index] print cv_train_x logreg = LogisticRegression(C = .01) logreg.fit(cv_train_x, cv_train_y) pred = logreg.predict(cv_val_x) print accuracy_score(cv_val_y, pred)

When I try to run this, I error out with the below error, so I add the print statement.<br />ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In my print statement, this is what it printed, some data became NaNs. variation length tempo 0 NaN NaN NaN 1 NaN NaN NaN 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178

I'm sure I'm doing something wrong, any ideas? As always, thank you so much!

Answer1:

To solve use .iloc instead of .ix to index your pandas dataframe

for train_index, val_index in kf: cv_train_x = X_train.iloc[train_index] cv_val_x = X_train.iloc[val_index] cv_train_y = y_train[train_index] cv_val_y = y_train[val_index] print cv_train_x logreg = LogisticRegression(C = .01) logreg.fit(cv_train_x, cv_train_y) pred = logreg.predict(cv_val_x) print accuracy_score(cv_val_y, pred)

Indexing with ix is usually equivalent to using .loc which is <strong>label based</strong> indexing, not <strong>index based</strong>. While .loc works on X which has a nice integer based indexing/labeling, after cv split this rule is no longer there, you get something like:

length tempo variation 4 509.931973 135.999178 0.001631 2 397.500952 112.347147 0.008146 7 502.083628 99.384014 0.009262 6 763.377778 107.666016 0.002513 5 560.365714 151.999081 0.001620 3 1109.819501 172.265625 0.005367 9 269.001723 117.453835 0.000733

and now you <strong>no longer have</strong> label 0 or 1, so if you do

X_train.loc[1]

you will get an Exception

KeyError: 'the label [1] is not in the [index]'

However, pandas has a <strong>silent error</strong> if you request multiple labels, where <strong>at least one exists</strong>. Thus if you do

X_train.loc[[1,4]]

you will get

length tempo variation 1 NaN NaN NaN 4 509.931973 135.999178 0.001631

As expected - 1 returns NaNs (since it was not found) and 4 represents actual row - since it is inside X_train. In order to solve it - just switch to .iloc or manually rebuild an index of X_train.

Recommend