Question

我有一个庞大的数据集，我想用LSTM进行多类分类。数据集由用户在MOOC课程中创建。根据他们的日志数据，我想预测他们的最终成绩（1,2，... 5）。日志数据包含超过25000个用户和161种事件。用户创建了不同长度的序列，并且我不想使用填充，因此我创建了一个生成器，该生成器将为fit_generator提供数据。

此脚本正在运行，但运行缓慢。你有什么主意，为什么？

数据表示如下：

user-唯一的用户名，event-由用户创建的事件，时间-上一次事件与实际用户事件之间的持续时间

User -+- event -+- time     
1454     play       0      (it is the first event so it has to be 0)    
1454     pause      10     (duration (second) between play and pause)   
1454     play       1      (duration (second) between pause and play)   
1454     stop       1      (duration (second) between play and stop)    
1000     play       0      (it is the first event so it has to be 0)    
1000     pause      455    (duration (second) between play and pause)   
1000     stop       1      (duration (second) between pause and stop)   
........

首先，我在“事件”列上设置了OneHotEncoder，并设置了目标值（y）

ohe = OneHotEncoder(sparse=False,handle_unknown="ignore")
ohe.fit(x.loc[:,['event']])

y = pd.get_dummies(df[1],prefix = None, columns=['pass'])
y.columns = ['user',0,1,2,3,4,5]    


y_train, y_test = train_test_split(y, test_size=0.25, random_state=42)

x_train = x.loc[x['user'].isin(y_train.user.unique())]
x_test = x.loc[x['user'].isin(y_test.user.unique())]

然后我创建了一个生成器

def generator_ohe(df_x,df_y,ohe):
    while True:
        # For item i in a range that is a length of l,
        for i in df_x.user.unique():
            df_o = ohe.transform(df_x.loc[df_x['user']==i,['event']])
            df_o = np.hstack((df_o,df_x.loc[df_x['user']==i,['time']].values)) 
            yield (np.reshape(df_o, (1,df_o.shape[0], df_o.shape[1])), df_y.loc[df_y['user'] == i,[0,1,2,3,4,5]])           


training_generator = generator_ohe(x_train,y_train,ohe)
validation_generator =generator_ohe(x_test,y_test,ohe)

modell

n_outputs = 6
features =161

model = Sequential()
model.add(LSTM(100, input_shape=(None,(features))))
model.add(Dropout(0.2))
model.add(Dense(100, activation='relu'))
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit_generator(training_generator,steps_per_epoch=19000, epochs=15,verbose = 1,
                    validation_data = validation_generator, validation_steps= 5000)

使用Keras中的生成器为LSTM modell提供序列

0 个答案: