Question

我们正在使用kaggle数据集：https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/version/2。它具有120年奥运会的数据。我们的目的是在先前的奥运会数据上训练我们的模型，并根据训练后的模型预测在下一届奥运会上可能获得的国家奖牌。我们使用属性：年龄，性别，身高，体重，NOC（国家/地区），运动，事件来预测我们的输出类别（金，银，青铜no_medal）。我们希望使用LSTM根据前几年的数据而不是120年的整个数据集进行预测。

但是使用LSTM面临的主要挑战是如何调整LSTM的输入。 LSTM的时间步长和样本量应该是多少？应该如何对数据进行分组以将其提供给LSTM。对于每个国家/地区，我们都对应于每年的奥林匹克运动和所有体育运动的可变行数。

我们在这一步上停留了几天。

如果有人可以请您深入了解如何将输入提供给LSTM，这将是非常不错的。

我们编写了这样的代码：

def lstm_classifier（final_data）：

country_count = len(final_data['NOC'].unique())
year_count = len(final_data['Year'].unique())

values = final_data.values
final_X = values[:, :-1]
final_Y = values[:, -1] 
print(country_count, ' ', year_count)

# reshape - # countries, time series, # attributes
#final_X = final_X.reshape(country_count, year_count, final_X.shape[1])
final_X = final_X.groupby("Country", as_index=True)['Year', 'Sex', 'Age', 'Height', 'Weight', 'NOC', 'Host_Country', 'Sport'].apply(lambda x: x.values.tolist())
final_Y = final_Y.groupby("Country", as_index=True)['Medal' ].apply(lambda x: x.values.tolist())

# define model - 10 hidden nodes
model = Sequential()
model.add(LSTM(10, input_shape = (country_count, final_X.shape[1])))
model.add(Dense(4, activation = 'sigmoid'))
model.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics = ['accuracy'])

# fit network
history = model.fit(final_X, final_Y, epochs = 10, batch_size = 50)

loss, accuracy = model.evaluate(final_X, final_Y)
print(accuracy)

Answer 1

我处于同样的情况。我想根据原始日志数据进行用户级别的预测。实际上，我不知道正确的解决方案，但是我已经掌握了一些技巧。

我认为您情况很好。首先，您必须将2D数据转换为3D，就像Jason Brownlee一样点击here！

另一个好例子单击here！

他们使用这种方法：

Keras LSTM层期望以3维（样本，时间步长，特征）的numpy数组的形式输入，其中样本是训练序列的数量，时间步长是回溯窗口或序列长度，特征是每个时间步的每个序列的特征数。

# function to reshape features into (samples, time steps, features) 
def gen_sequence(id_df, seq_length, seq_cols):
    """ Only sequences that meet the window-length are considered, no padding is used. This means for testing
    we need to drop those which are below the window-length. An alternative would be to pad sequences so that
    we can use shorter ones """
    data_array = id_df[seq_cols].values
    num_elements = data_array.shape[0]
    for start, stop in zip(range(0, num_elements-seq_length), range(seq_length, num_elements)):
        yield data_array[start:stop, :]

如果您找到了更好的解决方案，请不要犹豫，与我们分享：-）

使用Keras Python调整LSTM模型的输入

1 个答案: