# How to generate train and test sets for 5-fold cross validation

recategorized
I want to generate unique train and test sets to run 5-fold cross-validation. In each fold, 80% of the data should be selected as a train set and the remaining 20% as a test set. Each fold should have different data in the test set. How can I do this?

+1 vote
by (354k points)
selected by

You can use sklearn's StratifiedKFold() method to split the data into train and test sets to run 5-fold cross-validation. This method will generate unique test sets for each fold.

Here is an example:

import numpy as np
from sklearn.model_selection import StratifiedKFold

# sample data
X = np.array([[1, 2, 3], [2, 4, 6], [3, 6, 9], [4, 8, 12], [5, 10, 15],
[6, 12, 18], [7, 14, 21], [8, 16, 24], [9, 18, 27], [10, 20, 30]])
y = np.array([0, 1, 1, 1, 1, 0, 0, 0, 1, 0])

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1001)

for k, (train_idx, test_idx) in enumerate(kf.split(X, y)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# print('X_train:', X_train)
# print('y_train:', y_train)
print('fold: {0}, X_test: \n{1}'.format(k, X_test))
print('fold: {0}, y_test: {1}'.format(k, y_test))

The above code prints the following output. You can see that each fold has different test set.

fold: 0, X_test:
[[ 5 10 15]
[ 6 12 18]]
fold: 0, y_test: [1 0]
fold: 1, X_test:
[[ 2  4  6]
[10 20 30]]
fold: 1, y_test: [1 0]
fold: 2, X_test:
[[ 3  6  9]
[ 7 14 21]]
fold: 2, y_test: [1 0]
fold: 3, X_test:
[[ 1  2  3]
[ 9 18 27]]
fold: 3, y_test: [0 1]
fold: 4, X_test:
[[ 4  8 12]
[ 8 16 24]]
fold: 4, y_test: [1 0]