Python - 如何使用sklearn对多个文件进行预测和测试

时间:2017-07-26 12:44:18

标签: python python-3.x pandas scikit-learn random-forest

我想训练模型并最终使用three column datasetrandom forest model in Python预测真值(点击链接下载完整的CSV - 数据集格式如下所示< / p>

t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10

我想使用Y的最后一个(例如:5,10,100,300,1000,..等)数据点来预测X(真实值)的当前值在random forest model中使用sklearn的{​​{1}}。意味着将Python列的[0,0,1,2,3]作为第一个窗口的输入 - 我想预测X的第5行值,该值是Y的先前值。

假设我们在当前目录中有5条数据集(a1.csv,a2.csv,a3.csv,a4.csv和a5.csv)。对于单个跟踪(数据集)(例如,a1.csv) - 我可以预测5窗口,如下所示

Y

我用import pandas as pd import numpy as np from io import StringIO from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error from sklearn.metrics import accuracy_score import math from math import sqrt df = pd.read_csv('a1.csv') for i in range(1,5): df['X_t'+str(i)] = df['X'].shift(i) print(df) df.dropna(inplace=True) X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values y = df['Y'].values reg = RandomForestRegressor(criterion='mse') reg.fit(X,y) modelPred = reg.predict(X) print(modelPred) print("Number of predictions:",len(modelPred)) modelPred.tofile('predictedValues1.txt',sep="\n",format="%s") meanSquaredError=mean_squared_error(y, modelPred) print("Mean Square Error (MSE):", meanSquaredError) rootMeanSquaredError = sqrt(meanSquaredError) print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError) 解决了这个问题,产生了random forest

df

但是,现在我想通过将训练划分为60%的文件数据集来对所有文件(a1.csv,a2.csv,a3.csv,a4.csv和a5.csv)进行预测名称以 rolling_regression') time X Y X_t1 X_t2 X_t3 X_t4 0 0.000543 0 10 NaN NaN NaN NaN 1 0.000575 0 10 0.0 NaN NaN NaN 2 0.041324 1 10 0.0 0.0 NaN NaN 3 0.041331 2 10 1.0 0.0 0.0 NaN 4 0.041336 3 10 2.0 1.0 0.0 0.0 5 0.041340 4 10 3.0 2.0 1.0 0.0 6 0.041345 5 10 4.0 3.0 2.0 1.0 7 0.041350 6 10 5.0 4.0 3.0 2.0 ......................................................... [2845 rows x 7 columns] [ 10. 10. 10. ..., 20. 20. 20.] RMSE: 0.5136564734333562 开头,其余40%用于在a中使用asklearn开头的Python进行测试(意味着3条曲线将用于培训,2条文件用于测试)?

PS:所有文件都具有相同的结构,但它们具有不同的长度,因为它们是使用不同的参数生成的。

2 个答案:

答案 0 :(得分:2)

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)

答案 1 :(得分:1)

要读入多个文件,您需要稍加扩展。汇总每个csv的数据,然后调用pd.concat加入它们:

df_list = []
for i in range(1, 6):
    df_list.append(pd.read_csv('a%d.csv' %i))

df = pd.concat(df_list) 

这将读入你所有的csvs,你可以照常进行。获取Xy

X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values

使用sklearn.cross_validation.train_test_split细分您的数据:

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

您还可以查看StratifiedKFold