如何将训练和测试数据集拆分为X_Train y_train和X_Test y_Test?

时间:2017-11-16 04:36:16

标签: python pandas machine-learning scikit-learn

所以我成功地将我的数据集拆分为Train&以70:30的比例进行测试 我用过这个:

df_glass['split'] = np.random.randn(df_glass.shape[0], 1)
msk = np.random.rand(len(df_glass)) <= 0.7
train = df_glass[msk]
test = df_glass[~msk]
print(train)
print(test)

现在,如何将训练和测试分成X_trainy_train以及X_testy_test 这样,X表示数据库的特征,y表示响应?

我需要进行有监督的学习,并在X_Trainy_Train上应用ML模块。

我的数据库如下所示: Database_snippet

2 个答案:

答案 0 :(得分:1)

Scikit-Learn有一种方便的方法来分割pandas数据帧 -

这将进行拆分 -

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[list_of_X_cols], df['y'], test_size=0.33, random_state=42)

答案 1 :(得分:0)

我猜你可能会发现这对理解很有用..

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

#importing dataset
dataset = pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

#spliting the dataset into training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, 
test_size=1/3, random_state=0)