所以我成功地将我的数据集拆分为Train&以70:30的比例进行测试 我用过这个:
df_glass['split'] = np.random.randn(df_glass.shape[0], 1)
msk = np.random.rand(len(df_glass)) <= 0.7
train = df_glass[msk]
test = df_glass[~msk]
print(train)
print(test)
现在,如何将训练和测试分成X_train
和y_train
以及X_test
和y_test
这样,X
表示数据库的特征,y表示响应?
我需要进行有监督的学习,并在X_Train
和y_Train
上应用ML模块。
我的数据库如下所示: Database_snippet
答案 0 :(得分:1)
Scikit-Learn有一种方便的方法来分割pandas数据帧 -
这将进行拆分 -
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[list_of_X_cols], df['y'], test_size=0.33, random_state=42)
答案 1 :(得分:0)
我猜你可能会发现这对理解很有用..
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
#importing dataset
dataset = pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
#spliting the dataset into training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=1/3, random_state=0)