对于将random_state
和shuffle
一起使用,我有点困惑。我想拆分数据而不改组它。在我看来,当我将shuffle设置为False时,我为random_state选择的数字并不重要,我具有相同的输出(对于random_state 42或2、7、17等,拆分相同)。为什么?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42,shuffle=False )
但是,如果shuffle为True,那么对于不同的random_states,我会有不同的输出(拆分)。
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42)
答案 0 :(得分:1)
如果将shuffle
设置为False,train_test_split
只会按原始顺序读取数据。因此,参数random_state
被完全忽略。
示例:
X = [k for k in range(0, 50)] # create array with numbers ranging from 0 to 49
y = X # just for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=False)
print(X_train) // prints [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]
将shuffle
设置为True时,random_state
将用作随机数生成器的种子。结果,您的数据集被随机分为训练集和测试集。
random_state = 42的示例:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)
print(X_train) // prints [8, 3, 6, 41, 46, 47, 15, 9, 16, 24, 34, 31, 0, 44, 27, 33, 5, 29, 11, 36, 1, 21, 2, 43, 35, 23, 40, 10, 22, 18, 49, 20, 7, 42, 14, 28, 38]
random_state = 44的示例:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=44, shuffle=True)
print(X_train) // prints [13, 11, 2, 12, 34, 41, 30, 16, 39, 28, 24, 8, 18, 9, 4, 10, 0, 19, 21, 29, 14, 1, 48, 38, 7, 43, 25, 22, 23, 42, 46, 49, 32, 3, 45, 35, 20]