如何将数据集拆分为(X_train,y_train),(X_test,y_test)?

时间:2020-11-02 13:11:49

标签: python tensorflow keras scikit-learn

为了可重复性,我正在使用的训练和验证数据集是shared here

validation_dataset.csvtraining_dataset.csv的基本事实。

我下面要做的是将数据集输入到一个简单的CNN层中,该层提取图像的有用特征并将其作为1D信息输入到LSTM网络中进行分类。

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.convolutional import Conv1D
from keras.layers import LSTM
from keras.layers.convolutional import MaxPooling1D
from keras.layers import TimeDistributed
from keras.layers import Dropout
from keras import optimizers
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from numpy import genfromtxt

df_train = genfromtxt('data/train/training_dataset.csv', delimiter=',') 
df_validation = genfromtxt('data/validation/validation_dataset.csv', delimiter=',') 

#train,test = train_test_split(df_train, test_size=0.20, random_state=0)


df_train = df_train[..., None] 
df_validation = df_validation[..., None]


batch_size=8
epochs=5
    
model = Sequential()

model.add(Conv1D(filters=5, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
#model.add(TimeDistributed(Flatten()))
model.add(LSTM(50, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(LSTM(10))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)

model.compile(optimizer="rmsprop", loss='mse', metrics=['accuracy'])
callbacks = [EarlyStopping('val_loss', patience=3)]


model.fit(df_train, df_validation, batch_size=batch_size)

print(model.summary())

   
scores = model.evaluate(df_train, df_validation, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

我想将训练和验证数据集分成(X_train, y_train), (X_test, y_test),以便可以将两个数据集都用于训练和测试。我尝试了Scikit学习库-train,test = train_test_split(df_train, test_size=0.20, random_state=0)的split函数,但是在调用model.fit()函数后,它给了我以下错误。

ValueError: Data cardinality is ambiguous:
  x sizes: 14384
  y sizes: 3596
Please provide data which shares the same first dimension.

如何将数据集拆分为(X_train, y_train)(X_test, y_test)共享同一维?

2 个答案:

答案 0 :(得分:1)

一种方法是设置X和Y。在这里,我假设Y的列名是'target'。

target = df_train['target']
df_train = df_train.drop(columns=['target'])

X_train,X_test,y_train,y_test = train_test_split(df_train,target,test_size = 0.20,random_state = 0)

-

似乎我最初误解了您的问题,“ validation_dataset.csv”是您的标签数据。对于无法正确阅读,我深表歉意。

在这种情况下,您不需要“目标”变量,因为这就是df_validation的含义。因此,我认为以下方法可能有效:

X_train, X_test, y_train, y_test = train_test_split(df_train, df_validation, test_size=0.20, random_state=0)

答案 1 :(得分:0)

您将X传递给model.fit() df_train,将y传递给df_validation。您应该看一下文档here

代码应如下所示:

model.fit(X_train, y_train, validation_data=(X_val, y_val))