数据集“ review_data”包含Tripadvisor的评论和客户评分。我正在建立一个ULMFit语言模型来预测“审阅”中的文本序列。
以下数据框。
Review Rating
0 nice hotel expensive parking got good deal sta... 4
1 ok nothing special charge diamond member hilto... 2
2 nice rooms not 4* experience hotel monaco seat... 3
3 unique, great stay, wonderful time hotel monac... 5
4 great stay great stay, went seahawk game aweso... 5
review_data.shape
(20491, 2)
review_data = review_data[['Rating', 'Review']]
#Split into train and val data
df_trn, df_val = train_test_split(review_data, stratify = review_data['Rating'], test_size = 0.2, random_state = 12)
print(df_trn.shape, df_val.shape)
(16392, 2) (4099, 2)
# Language model data
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "")
#Building the language model
learn = language_model_learner(data_lm,arch = AWD_LSTM, drop_mult=0.3)
我正在使用最佳学习率来训练语言模型;
learn.lr_find()
learn.recorder.plot(suggestion = True)
min_grad_lr = learn.recorder.min_grad_lr
learn.fit_one_cycle(3,min_grad_lr)
epoch train_loss valid_loss accuracy time
0 5.987871 5.842046 0.163807 02:13
1 5.675927 5.674133 0.173539 02:13
2 5.304801 5.632321 0.176002 02:13
训练后准确性很低 在对模型进行微调后,精度并没有提高
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)
epoch train_loss valid_loss accuracy time
0 5.180523 5.562634 0.181098 02:39
1 5.121043 5.504951 0.185985 02:39
2 4.919733 5.491002 0.187887 02:39
3 4.678843 5.540877 0.187085 02:39
4 4.506824 5.582721 0.184676 02:39
这可能是什么原因?有什么方法可以提高准确性?