如何标记文本以用作keras神经网络的输入

时间:2018-12-03 02:46:07

标签: python tensorflow keras nlp

我正在尝试对由标记为(0,4)之间的数字的带标签英文文本组成的数据集进行情感分析。

我一直在这里遵循tensorflow指南:https://www.tensorflow.org/tutorials/keras/basic_text_classification 适应了我的多类分类问题。

数据集的样本在这里:

PhraseId,SentenceId,Phrase,Sentiment
21071,942,irony,1
63332,3205,Blue Crush ' swims away with the Sleeper Movie of the Summer award .,2
142018,7705,in the third row of the IMAX cinema,2
103601,5464,images of a violent battlefield action picture,2 .
12235,523,an engrossing story,3
77679,3994,should come with the warning `` For serious film buffs only !,2
58875,2969,enjoyed it,3
152071,8297,"A delicious , quirky movie with a terrific screenplay and fanciful direction by Michael Gondry .",4

当前,我的模型表现非常差,始终保持约0.5的精确度,并且在各个时期都不会改变。

我知道如何调整模型的超参数以及我可以尝试的所有技巧,但似乎无济于事。我坚信我在处理数据时犯了一个错误,因为这是我第一次使用文本数据进行深度学习。

我当前的预处理包括:

  • 从数据集中删除PhraseID和SentenceID列
  • 删除标点符号和大写字母
  • 改组数据集的顺序
  • 将数据和标签分成不同的数据框
  • 一键编码标签
  • 使用Keras预处理分词器对数据进行分词
  • 将序列填充到相同的长度

我认为标记化阶段存在问题,或者也许我只是不了解该模型如何将标记化词作为输入向量并可以从中学习。

我相关的令牌化代码是:

    def tokenize_data(self, df, max_features=5000):
    self.logger.log(f'Tokenizing with {max_features} features')
    tokenizer = Tokenizer(num_words=max_features, split=' ')
    tokenizer.fit_on_texts(df.values)
    train_set = tokenizer.texts_to_sequences(df.values)
    if self.logger.verbose_f : self.logger.verbose(train_set[:10])
    return train_set

def pad_sequences(self, data, maxlen=5000):
    result = keras.preprocessing.sequence.pad_sequences(data,
                                                    value=0,
                                                    padding='post',
                                                    maxlen=maxlen)
    if self.logger.verbose_f:
        df = pd.DataFrame(result)
        df.to_csv("processed.csv")

    return result

填充序列的输出如下:

7,821,3794,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

8,74,44,344,325,2904,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
每个实例的

etc等。

这些值将像这样输入到模型中,作为训练数据。

在对此进行培训之前,是否需要进行某种规格化?

还是我完全把错误的树种了?

谢谢

1 个答案:

答案 0 :(得分:0)

  • 大多数数据预处理都很好,不需要任何更改。
  • 应该对数据进行归一化,因为0和2045之类的数字之间会有很大差异,并且会导致梯度爆炸。