Question

我想使用两种资产（BTC、ETH）历史数据来预测第二天的价格。
历史数据包括 OHLCV 数据、市值和这两种资产的主导地位。所以，有一堆数值数据。
预测将是第二天价格的二进制（0 或 1），其中 0 表示价格将下降，1 表示明天价格将上涨。

这是初始数据的截图：

最后一列值向上移动 -1。因此，今天的数据将用于查看第二天是绿色还是红色。
我使用 MinMaxScaler 来缩放数据，如下所示：

min_max_scaler = MinMaxScaler()
clean_df_scaled = min_max_scaler.fit_transform(all_data)
dataset = pd.DataFrame(clean_df_scaled)

#train test validation split
x_train, x_test, y_train, y_test = train_test_split(dataset.iloc[:, :15], dataset.iloc[:, 15], test_size=0.2)

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2)

y_train = np.array(y_train, dtype=np.int)
y_test = np.array(y_test, dtype=np.int)
y_val = np.array(y_val, dtype=np.int)

x_train = np.reshape(np.asarray(x_train), (x_train.shape[0], x_train.shape[1], 1))
x_test = np.reshape(np.asarray(x_test), (x_test.shape[0], x_test.shape[1], 1))
x_val = np.reshape(np.asarray(x_val), (x_val.shape[0], x_val.shape[1], 1))

这是模型：

model = Sequential()
model.add(LSTM(64, input_shape=(x_train.shape[1], x_train.shape[2]), return_sequences=True))
model.add(LSTM(32))
model.add(Dense(8, input_dim=16, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=100)

test_loss, test_acc = model.evaluate(x_val, y_val)
print('Test accuracy:', test_acc)

输出显示：

...
Epoch 98/100
24/24 [==============================] - 0s 12ms/step - loss: 0.6932 - accuracy: 0.4968
Epoch 99/100
24/24 [==============================] - 0s 12ms/step - loss: 0.6932 - accuracy: 0.4998
Epoch 100/100
24/24 [==============================] - 0s 13ms/step - loss: 0.6929 - accuracy: 0.5229
6/6 [==============================] - 1s 4ms/step - loss: 0.6931 - accuracy: 0.5027
Test accuracy: 0.5027027130126953

我不明白这里有什么问题！我也使用了 softmax 激活，但运气不佳（我知道，我应该为此使用 Sigmoid）。
我什至尝试移除 LSTM 层，只使用 Dense。但仍然没有运气。

附注：

当我使用模型进行预测时：

predictions = model.predict(x_test)

它不返回二进制值，它返回这样的浮点数：

...
[0.5089301 ],
       [0.5093736 ],
       [0.5081916 ],
       [0.50889516],
       [0.5077091 ],
       [0.5088633 ]], dtype=float32)

正常吗？我应该根据平均值将它们转换为二进制（0 或 1）吗？

Answer 1

野生动物园

我相信 Horace Lee 已经在评论中指出问题出在 train_test_split 上。而且数据排列也有问题。在示例数据和 train_test_split 的使用方式中，每行代表一个样本，每列包含一个数据特征。但是您尝试建模的时间序列是按列编码的。当数据输入模型时，时间依赖关系不存在，因为样本包含相同数据点的信息。因此 LSTM 层无法找到任何关系，因为序列依赖不是按行编码的。

你可以按照你做的一样的比例分割数据，但不要洗牌。

x_train, x_test, y_train, y_test =dataset.iloc[0:int(len(dataset)*0.8), :15],dataset.iloc[0:int(len(dataset)*0.8), 15], dataset.iloc[int(len(dataset)*0.8):-1, :15],dataset.iloc[int(len(dataset)*0.8):-1, 15]

并在 shuffle=False 处更改 model.fit 以防止任何数据混洗。这将保留数据中的序列依赖性。

此外，由于数据集中的每一列都是一个时间序列，因此您可以使用窗口方法对每个时间序列进行独立建模。只需让一个窗口大小的片段广告一次一次地滑过数据即可。

window_dataset=[dataset.iloc[k:k+window, “any feature column”] for k in range(int(len(dataset)*0.8))]

target=[dataset.iloc[k+window, 15] for k in range(int(len(dataset)*0.8))]

但在尝试 LSTM 架构之前，请尝试使用仅密集层或单层 LSTM 的顺序模型，并使用 data['target_header'].value_counts() 检查数据中的不平衡获取一个连续的数据片段可以获取一个特定类别的更多样本。

Answer 2

以下答案是基于您获得.predict这样的回报分数的奇迹。

当您传递 model.predict(x_test) 时，它会为您提供矩阵，其中每一行代表这些输入在 class 1 中的概率。这样，您就获得了 x_test 的每个实例出现在 class 1 中的概率。

...
[0.5089301 ],
       [0.5093736 ],
       [0.5081916 ],
       [0.50889516],
       [0.5077091 ],
       [0.5088633 ]], dtype=float32)

为了得到二进制输出，通常我们设置一个阈值（比如 0.5），大于这个值被认为是 class 1，低于它被认为是 class 0。因此，您可以执行以下操作来获得二进制输出（1 和 0）

(model.predict(x_test) > 0.5).astype("int32")

这里，0.5 是我们选择的阈值。查看 this answer 了解更多详情。

在二元分类问题中，Keras 准确率停留在 50%

附注：

2 个答案: