将数据从Pandas数据帧转换为keras LSTM的时间序列训练数据

时间:2018-04-28 07:12:44

标签: python pandas tensorflow keras lstm

我正在使用Keras和Hyperas与LSTM机器一起预测价格估值。我在从Pandas DataFrame格式化数据以用于在LSTM模型中训练和测试数据时遇到问题。

这就是我此刻阅读和分割数据的方式:

def data():
    maxlen = 100
    max_features = 20000
    #read the data
    df = DataFrame(pd.read_json('eth_usd_polo.json'))

    #normalize data
    scaler = MinMaxScaler(feature_range=(-1,1))
    df[['weightedAverage']] = scaler.fit_transform(df[['weightedAverage']])
    X = df[df.columns[-1:]]
    Y = df['weightedAverage']
    X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.33)


    return X_train, X_test, y_train, y_test, max_features, maxlen

从数据框中我真的只对“weightedAverage”列感兴趣并且它是相应的价格。因为我正在做一个单变量的时间序列预测。

这就是我构建模型的地方:

def create_model(X_train, X_test, y_train, y_test, max_features, maxlen):
    #Build the model
    model = Sequential()
    model.add(LSTM(input_shape=(10, 1), return_sequences=True, units=20))
    model.add(Dropout(1))
    model.add(LSTM(20, return_sequences=False))
    #model.add(Flatten())
    model.add(Dropout(0.2))
    model.add(Dense(units=1))
    #model.add(Activation("linear"))

    #compile
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
                  optimizer={{choice(['rmsprop', 'adam', 'sgd'])}})

    #the monitor and earlystopping for the model training
    #monitor = EarlyStopping(monitor ='val_loss', patience=5,verbose=1, mode='auto')

    #fit everything together
    #model.fit(x_train ,y_train, validation_data=(x_test, y_test), callbacks =[monitor], verbose=2, epochs=1000)
    model.fit(X_train, y_train,
        batch_size={{choice([64, 128])}},
        epochs=1,
        verbose=2,
        validation_data=(X_test, y_test))

    score, acc = model.evaluate(X_test, y_test, verbose=0)

    print('Test accuracy:', acc)
    return {'loss': -acc, 'status': STATUS_OK, 'model': model}

我从Pandas DF中提取和处理数据的方式似乎出现了问题。返回的数据(X_train,X_test等)应采用以下形式:

(25000, 10)
[[ data data data .... data data]
 [ data data data .... data data]
.
.
.
[ data data data .... data data]]

而是将其格式化为:

   (7580, 1)
        weightedAverage
12420       255.151685
20094       871.386896
12099       300.802114

我认为train_test_split函数可以帮助我将数据拆分并格式化为正确的大小,但它似乎没有按照我想要的那样做。

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

经过大量的摆弄和反复试验,我得到了它的工作。 现在,我的LSTM机器的数据格式精美,而且工作得很好。

它现在还可以处理多变量输入,我希望这些输入可以提高预测质量。

def data():
    maxlen = 10
    steps = 10
    #read the data
    print('Loading data...')
    df = (pd.read_json('eth_usd_polo.json'))
    df = df.drop('date', axis=1)
    #normalize data
    scalerList = []
    for head in df.dtypes.index:
        scaler = MinMaxScaler(feature_range=(-1,1))
        df[[head]] = scaler.fit_transform(df[[head]])

        scalerList.append(scaler)
    Xtemp = np.array(df)
    X = np.zeros((len(Xtemp)-maxlen-steps,maxlen,len(Xtemp[0])))
    Y = np.zeros((len(X),steps))
    for i in range(0, len(X)):
        for j in range(steps):
            Y[i][j] = Xtemp[maxlen+i+j][6]

        for j in range(len(X[0])):
            for k in range(len(X[0][0])):
                X[i][len(X[0])-1-j][k] = Xtemp[maxlen+i-j-1][k]
    X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.33, shuffle=True)    
    return X_train, X_test, y_train, y_test, maxlen, steps