我想将数据集拆分为训练/测试拆分。但是,我想代替常规的百分比拆分,而是将测试数据作为“ subject01.dat”,将其他数据作为培训数据。我该怎么办?
如果重要,则数据集是时间序列3D数据。但是经过我的预处理后,它变成了一个二维的numpy数组。
我当时正在考虑使用sklearn.test_train_split
,但是我可以设置哪些选项来确保将“ subject01.dat”保留为测试数据集?
import pandas as pd
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split
dir = '/home/hanna/Documents/_DDA_Lab/Exercise6/PAMAP2_Dataset/Protocol/'
filelist = ['subject101.dat','subject102.dat','subject103.dat','subject104.dat','subject105.dat','subject106.dat','subject107.dat']
# Required columns
columns = [1,2,4,5,6,7,8,9,10,11,12,13,14,15,20,21,22,23,24,25,26,27,28,29,30,31,32,37,38,39,40,41,42,43,44,45,46,47,48,49]
# Required rows
ID_rows = [3,4,12,13]
for file in filelist:
input = dir + file
df = pd.read_csv(input, header=None, delim_whitespace=True)
print('Done reading data file ', input)
df = df[columns] # Keep only the required columns & drop the rest
df = df[df[1].isin(ID_rows)] # Keep only the required rows & drop the rest
df=df.fillna(0) # Replace NaNs with zeros
df = (df - df.mean()) / df.std() # Normalize data
data.append(df)
df = pd.concat(data) # Merge into one dataframe
print(df.shape)
# Convert dataframe into tensor
x_data = df.drop(1, axis=1).values
y_data = df[[1]].values
# Train / Test split
xTrain, xTest, yTrain, yTest = train_test_split(x_data, y_data, test_size=0.15, random_state=0)
答案 0 :(得分:2)
我不确定我是否理解正确,但是我认为您可以为训练集中的每个文件创建一个单独的DataFrame,为test_set文件创建一个单独的DataFrame。
例如,假设subject101.dat将是您的测试集:
filelist_test = ['subject101.dat']
filelist_train = ['subject102.dat','subject103.dat','subject104.dat','subject105.dat','subject106.dat','subject107.dat']
for train_file in filelist_train:
# Do the same
train_df = pd.concat(data)
for test_file in filelist_test:
# Do the same
test_df = pd.concat(data)
此后,您可以删除所需的标签列,并执行与之前相同的操作。希望这会有所帮助。