我正在尝试对由标记为(0,4)之间的数字的带标签英文文本组成的数据集进行情感分析。
我一直在这里遵循tensorflow指南:https://www.tensorflow.org/tutorials/keras/basic_text_classification 适应了我的多类分类问题。
数据集的样本在这里:
PhraseId,SentenceId,Phrase,Sentiment
21071,942,irony,1
63332,3205,Blue Crush ' swims away with the Sleeper Movie of the Summer award .,2
142018,7705,in the third row of the IMAX cinema,2
103601,5464,images of a violent battlefield action picture,2 .
12235,523,an engrossing story,3
77679,3994,should come with the warning `` For serious film buffs only !,2
58875,2969,enjoyed it,3
152071,8297,"A delicious , quirky movie with a terrific screenplay and fanciful direction by Michael Gondry .",4
当前,我的模型表现非常差,始终保持约0.5的精确度,并且在各个时期都不会改变。
我知道如何调整模型的超参数以及我可以尝试的所有技巧,但似乎无济于事。我坚信我在处理数据时犯了一个错误,因为这是我第一次使用文本数据进行深度学习。
我当前的预处理包括:
我认为标记化阶段存在问题,或者也许我只是不了解该模型如何将标记化词作为输入向量并可以从中学习。
我相关的令牌化代码是:
def tokenize_data(self, df, max_features=5000):
self.logger.log(f'Tokenizing with {max_features} features')
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(df.values)
train_set = tokenizer.texts_to_sequences(df.values)
if self.logger.verbose_f : self.logger.verbose(train_set[:10])
return train_set
def pad_sequences(self, data, maxlen=5000):
result = keras.preprocessing.sequence.pad_sequences(data,
value=0,
padding='post',
maxlen=maxlen)
if self.logger.verbose_f:
df = pd.DataFrame(result)
df.to_csv("processed.csv")
return result
填充序列的输出如下:
7,821,3794,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,74,44,344,325,2904,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
每个实例的etc等。
这些值将像这样输入到模型中,作为训练数据。
在对此进行培训之前,是否需要进行某种规格化?
还是我完全把错误的树种了?
谢谢
答案 0 :(得分:0)