我有一个庞大的数据集,我想用LSTM进行多类分类。 数据集由用户在MOOC课程中创建。根据他们的日志数据,我想预测他们的最终成绩(1,2,... 5)。 日志数据包含超过25000个用户和161种事件。 用户创建了不同长度的序列,并且我不想使用填充,因此我创建了一个生成器,该生成器将为fit_generator提供数据。
此脚本正在运行,但运行缓慢。你有什么主意,为什么?
数据表示如下:
user-唯一的用户名,event-由用户创建的事件 ,时间-上一次事件与实际用户事件之间的持续时间
User -+- event -+- time
1454 play 0 (it is the first event so it has to be 0)
1454 pause 10 (duration (second) between play and pause)
1454 play 1 (duration (second) between pause and play)
1454 stop 1 (duration (second) between play and stop)
1000 play 0 (it is the first event so it has to be 0)
1000 pause 455 (duration (second) between play and pause)
1000 stop 1 (duration (second) between pause and stop)
........
首先,我在“事件”列上设置了OneHotEncoder,并设置了目标值(y)
ohe = OneHotEncoder(sparse=False,handle_unknown="ignore")
ohe.fit(x.loc[:,['event']])
y = pd.get_dummies(df[1],prefix = None, columns=['pass'])
y.columns = ['user',0,1,2,3,4,5]
y_train, y_test = train_test_split(y, test_size=0.25, random_state=42)
x_train = x.loc[x['user'].isin(y_train.user.unique())]
x_test = x.loc[x['user'].isin(y_test.user.unique())]
然后我创建了一个生成器
def generator_ohe(df_x,df_y,ohe):
while True:
# For item i in a range that is a length of l,
for i in df_x.user.unique():
df_o = ohe.transform(df_x.loc[df_x['user']==i,['event']])
df_o = np.hstack((df_o,df_x.loc[df_x['user']==i,['time']].values))
yield (np.reshape(df_o, (1,df_o.shape[0], df_o.shape[1])), df_y.loc[df_y['user'] == i,[0,1,2,3,4,5]])
training_generator = generator_ohe(x_train,y_train,ohe)
validation_generator =generator_ohe(x_test,y_test,ohe)
modell
n_outputs = 6
features =161
model = Sequential()
model.add(LSTM(100, input_shape=(None,(features))))
model.add(Dropout(0.2))
model.add(Dense(100, activation='relu'))
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit_generator(training_generator,steps_per_epoch=19000, epochs=15,verbose = 1,
validation_data = validation_generator, validation_steps= 5000)