如何为带有清晰时间戳的时间序列数据构建数据框?

时间:2019-05-24 13:25:06

标签: python pandas dataframe scikit-learn time-series

对于我的实验,我有一个格式化的csv文件,它看起来像一个矩阵[NxM],其中N = 40个周期总数(时间戳),M = 1440个像素。对于每个周期,我都有1440个像素值,对应于1440个像素。如下所示:

timestamps[row_index] | feature1  | feature2 | ... | feature1439 | feature1440 |
-----------------------------------------------------------------
       1              |  1.00     |   0.32   |   0.30   |   0.30  |   0.30   | 
       2              |  0.35     |   0.33   |   0.30   |   0.30  |   0.30   | 
       3              |  1.00     |   0.33   |   0.30   |   0.30  |   0.30   | 
      ...             |   ....    |   ....   |   ....   |   ....  |   ....   | 
                      | -1.00     |   0.26   |   0.30   |   0.30  |   0.30   | 
                      |   0.67    |   0.03   |   0.30   |   0.30  |   0.30   | 
       30             |   0.75    |   0.42   |   0.30   |   0.30  |   0.30   |
________________________________________________________________________________ 
      31              |  -0.36    |   0.42   |   0.30   |   0.30  |   0.30   | 
      ...             |   ....    |   ....   |   ....   |   ....  |   ....   | 
      40              |   1.00    |   0.34   |   0.30   |   0.30  |  -1.00   |

img

我想将数据集切分为训练集和测试集,以便:

火车套包含[1-30]个时间戳信息

测试集包含[31-40]个时间戳信息

问题是在训练NN之后,我无法获得正确的连续绘图,这很可能是由于我通过train_test_split使用过但但从未被TimeSeriesSplit尝试过的不良数据拆分技术,如下所示:

trainX, testX, trainY, testY = train_test_split(trainX,trainY, test_size=0.2 , shuffle=False) 

考虑到我已经使用过shuffle=False,并且期望将数据中的中的0.2视为测试数据,我可以正确地绘制它们,但是仍然无法访问该数字视为测试数据的周期数,因此当我绘制时,它从0开始!而不是继续训练数据的最后一个周期!

我想知道是否最好将数据传递到pd.DataFrame并尝试根据此post通过pd.Timestamp切片数据!是有帮助还是不必要?

更新-完整代码: 我的列标签遵循以下模式,只是预测1440列中的960列:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.layers import Dense , Activation , BatchNormalization
from keras.layers import Dropout
from keras.layers import LSTM,SimpleRNN
from keras.models import Sequential
from keras.optimizers import Adam, RMSprop

data_train = pd.read_csv("D:\train.csv", header=None)
#select interested columns to predict 980 out of 1440
j=0
index=[]
for i in range(1439):
    if j==2:
        j=0
        continue
    else:
        index.append(i)
        j+=1

Y_train= data_train[index]
data_train = data_train.values
print("data_train size: {}".format(Y_train.shape))

创建历史记录

def create_dataset(dataset,data_train,look_back=1):
    dataX,dataY = [],[]
    print("Len:",len(dataset)-look_back-1)

    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), :]
        dataX.append(a)
        dataY.append(data_train[i + look_back,  :])
    return np.array(dataX), np.array(dataY)

look_back = 10
trainX,trainY = create_dataset(data_train,Y_train, look_back=look_back)
#testX,testY = create_dataset(data_test,Y_test, look_back=look_back)
trainX, testX, trainY, testY = train_test_split(trainX,trainY, test_size=0.2)
print("train size: {}".format(trainX.shape))
print("train Label size: {}".format(trainY.shape))
print("test size: {}".format(testX.shape))
print("test Label size: {}".format(testY.shape))

Len: 29
train size: (23, 10, 1440)
train Label size: (23, 960)
test size: (6, 10, 1440)
test Label size: (6, 960)

RNN,LSTM,GRU实现类似

# create and fit the SimpleRNN model
model_RNN = Sequential()
model_RNN.add(SimpleRNN(units=1440, input_shape=(trainX.shape[1], trainX.shape[2])))
model_RNN.add(Dense(960))
model_RNN.add(BatchNormalization())
model_RNN.add(Activation('tanh'))
model_RNN.compile(loss='mean_squared_error', optimizer='adam')
callbacks = [
    EarlyStopping(patience=10, verbose=1),
    ReduceLROnPlateau(factor=0.1, patience=3, min_lr=0.00001, verbose=1)]
hist_RNN=model_RNN.fit(trainX, trainY, epochs =50, batch_size =20,validation_data=(testX,testY),verbose=1, callbacks=callbacks)

最后,我希望看到下面的输出图:

Y_RNN_Test_pred=model_RNN.predict(testX)
test_RNN= pd.DataFrame.from_records(Y_RNN_Test_pred)
test_RNN.to_csv('New/ttest_RNN_history.csv', sep=',', header=None, index=None)
test_MSE=mean_squared_error(testY, Y_RNN_Test_pred)

plt.plot(trainY[:,0],'b-',label='Train data')
plt.plot(testY[:,0],'c-',label='Test data')
plt.plot(Y_RNN_Test_pred[:,0],'r-',label='prediction')

img

1 个答案:

答案 0 :(得分:1)

索引只有一个小问题。

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

df = pd.read_csv('Train.csv', header=None)

# I'm not sure what the label-column is, so i use df[0]
# and exclude this colums in the data via df.loc[:,df.columns!=0]
trainX,testX,trainY,testY = train_test_split(df.loc[:,df.columns!=0],df[0], test_size=0.2, shuffle=False)

plt.plot(trainY)
plt.plot(testY)

enter image description here

似乎很好。 :-)

所以现在我们要预测:

from sklearn.svm import SVR
reg = SVR(C=1, gamma='auto')
reg.fit(trainX, trainY) 
predY = reg.predict(testX)

plt.plot(trainY)
plt.plot(testY)
plt.plot(predY)

enter image description here

索引错误:-( 让我们修复此问题,例如使用testY的索引:

plt.plot(trainY)
plt.plot(testY)
plt.plot(testY.index,predY)

enter image description here

编辑

一个更通用的解决方案是采用火车数据集长度的范围并将其设置为索引,与testYpredY相同,只是起始值不同(长度为{{ 1}})

trainY

根据您的新代码进行编辑

trainY.index = range(len(trainY))
testY.index = range(len(trainY), len(trainY)+len(testY))
#Maybe convert to DataFrame first
predY = pd.DataFrame(predY)
predY.index = range(len(trainY), len(trainY)+len(predY))

plt.plot(trainY)
plt.plot(testY)
plt.plot(predY)

编辑2

好吧,让我们一步一步地完成代码:

trainY.index = range(len(trainY))
testY.index = range(len(trainY), len(trainY)+len(testY))
test_RNN.index = range(len(trainY), len(trainY)+len(test_RNN))

plt.plot(trainY,'b-',label='Train data')
plt.plot(testY,'c-',label='Test data')
plt.plot(test_RNN,'r-',label='prediction')

实际上,您仅选择960列进行预测,请参见下文。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from keras.layers import Dense , Activation , BatchNormalization
from keras.layers import Dropout
from keras.layers import LSTM,SimpleRNN
from keras.models import Sequential
from keras.optimizers import Adam, RMSprop

data_train = pd.read_csv("Train.csv", header=None)
#select interested columns to predict 980 out of 1440

如果我理解您的循环正确,那么您只想取两个值中的每三个。因此,列表理解#j=0 #index=[] #for i in range(1439): # if j==2: # j=0 # continue # else: # index.append(i) # j+=1 idx2 = [i for i in list(range(1440)) if i%3!=2] 更快一些。您可能还希望包含所有列?因此,请使用idx2 = [i for i in list(range(1440)) if i%3!=2]而不是1440

1439

在您的代码中,Y_train= data_train[index] data_train = data_train.values print("data_train size: {}".format(Y_train.shape)) 的形状为Y_train。因此,您想预测690个变量,对吗?如果是这样,“干净”的方法是从(40,960)中删除这些列(并创建一个data_train):

X_train

现在让我们检查形状:

index2 = [i for i in list(range(1440)) if i%3==2]
X_train = data_train[index2]

似乎正确...;-)

我在下一部分做了一些修改: -您不需要在范围(print("X_train size: {}".format(X_train.shape)) print("Y_train size: {}".format(Y_train.shape)) >X_train size: (40, 480) >Y_train size: (40, 960) 内减去1。与某些其他编程语言不同,Python不包含最后一个值,因此例如,如果您执行for i in range(len(dataset)-look_back):,则列表将是list(range(0,3))可能这是您遗漏的10个值(最后一个)... -我还从[0,1,2]

拿走了values
Y_train
def create_dataset(dataset,data_train,look_back=1):
    dataX,dataY = [],[]

    for i in range(len(dataset)-look_back):
        a = dataset[i:(i+look_back), :]
        dataX.append(a)
        dataY.append(data_train[i+look_back, :])
    return np.array(dataX), np.array(dataY)

look_back = 10
trainX,trainY = create_dataset(X_train.values, Y_train.values, look_back=look_back)
trainX, testX, trainY, testY = train_test_split(trainX,trainY, test_size=0.2)

我必须在训练print("train size: {}".format(trainX.shape)) print("train Label size: {}".format(trainY.shape)) print("test size: {}".format(testX.shape)) print("test Label size: {}".format(testY.shape)) >train size: (24, 10, 480) >train Label size: (24, 960) >test size: (6, 10, 480) >test Label size: (6, 960) 中添加两个导入,所以:

from keras.callbacks import EarlyStopping, ReduceLROnPlateau

做出预测(未修改):

from keras.callbacks import EarlyStopping, ReduceLROnPlateau
# create and fit the SimpleRNN model
model_RNN = Sequential()
model_RNN.add(SimpleRNN(units=1440, input_shape=(trainX.shape[1], trainX.shape[2])))
model_RNN.add(Dense(960))
model_RNN.add(BatchNormalization())
model_RNN.add(Activation('tanh'))
model_RNN.compile(loss='mean_squared_error', optimizer='adam')
callbacks = [
    EarlyStopping(patience=10, verbose=1),
    ReduceLROnPlateau(factor=0.1, patience=3, min_lr=0.00001, verbose=1)]
hist_RNN=model_RNN.fit(trainX, trainY, epochs =50, batch_size =20,validation_data=(testX,testY),verbose=1, callbacks=callbacks)

并按照上述说明在x轴上进行修改后绘制数据:

Y_RNN_Test_pred=model_RNN.predict(testX)
test_RNN= pd.DataFrame.from_records(Y_RNN_Test_pred)
#test_RNN.to_csv('New/ttest_RNN_history.csv', sep=',', header=None, index=None)
test_MSE=mean_squared_error(testY, Y_RNN_Test_pred)

enter image description here