情感分析RNN未学习

时间:2019-08-27 18:54:13

标签: python tensorflow recurrent-neural-network cudnn tf.keras

我正在尝试创建用于情感分析的RNN。我的训练数据是“ Sentiment140”数据集。

问题在于,尽管代码可以运行,但运行速度非常慢(尽管这可能是由于我的计算机造成的),并且精度似乎仍然停滞在50%左右,这或多或少是随机的。

我正在使用“ word2vec”系统预处理数据,其中每个单词都表示为一个热数组。然后将这些数组放置在另一个数组中,该数组随后将完全代表一条推文。每个单词数组也被填充,因此它们的长度都相同。

class Tweet:

    def __init__(self, line):
        self.line = line

    def Process(self):

        #using file structure

        # polarity = int(self.line.split(",")[0][1:])

        polarity = int(self.line[0])

        if polarity == 0:
            polarity = 0
        elif polarity == 2:
            polarity = 1
        elif polarity == 4:
            polarity = 2

        words = self.line[-1].lower()

        #removes punctuation etc
        words = clean(words)

        TokenisedText = nltk.word_tokenize(words)

        # lemmatised, tokenised text for entire line
        TokenisedText = [WordNetLemmatizer().lemmatize(word) for word in TokenisedText]

        WordVectors = np.zeros(shape=(MaxWordsLength, LexiconLength))

        for i in range(0,len(TokenisedText)):
            word = TokenisedText[i]

            WordVector = np.zeros(LexiconLength)

            index = np.where(lexicon == word)
            WordVector[index] = 1
            # same as += 1 here as max value for any one element is 1 - one hot array

            # may be inefficient to append
            WordVectors[i] = WordVector

        return WordVectors, polarity

# tweet structure [ [1,0,0,...], ... (incl. padding) ], ], [0,1,0,...], ... (incl. padding) ], ... ]
# polarity - 0, 1 or 2

#preprocesses the data and yields in batches
def GrabTrainingData():

    with open(ShuffledTrainingPath, "r", buffering=200000, encoding="latin-1") as InputFile:
        i = 0
        while True:
            # BatchSize: how many tweets are you passing at one time?
            # MaxWordsLength: how many word arrays in each tweet? ADDS PADDING TO ENSURE UNIFORM TENSOR SIZES

            # DOES THIS MEAN CAN'T ANALYSE ANYTHING LONGER THAN MaxWordsLength?!?!

            # LexiconLength: how long is each word array?
            BatchTrain_x = np.zeros(shape=(BatchSize, MaxWordsLength, LexiconLength))
            BatchTrain_y = np.zeros(BatchSize)

            reader = csv.reader(InputFile)
            for line in reader:
                WordVectors, polarity = Tweet(line).Process()

                BatchTrain_x[i] = WordVectors
                BatchTrain_y[i] = polarity

                i+=1

                if i>= BatchSize:
                    i = 0
                    yield BatchTrain_x, BatchTrain_y

            # loops file again
            i=0


#neural network model
def RNN():
    model = keras.models.Sequential()

    NumberOfClassifications = 3

    model.add(keras.layers.CuDNNLSTM(128, input_shape=(MaxWordsLength, LexiconLength), return_sequences=True))
    model.add(keras.layers.Dropout(0.2))

    model.add(keras.layers.CuDNNLSTM(128))
    model.add(keras.layers.Dropout(0.2))

    model.add(keras.layers.Dense(32, activation="relu"))
    model.add(keras.layers.Dropout(0.2))

    model.add(keras.layers.Dense(NumberOfClassifications, activation="softmax"))

    MyOptimiser = keras.optimizers.Adam(lr=1e-3, decay=1e-5)

    model.compile(loss="sparse_categorical_crossentropy", optimizer = MyOptimiser, metrics=["accuracy"] )

    x_test, y_test = ProcessTestingData(ShuffledTestingPath)
    model.fit_generator(generator=GrabTrainingData(), epochs=3, steps_per_epoch=math.ceil(TrainingNumberOfLines / BatchSize), validation_data=(x_test, y_test))

    model.save(os.path.join(sys.path[0], "MyRNN.h5"))

P.S。附带说明一下,有人可以解释一下如何在数据生成器中正确使用use_multiprocessing吗?我知道我必须使用keras.layers.Sequence(),但是我不确定如何使用。

任何帮助将不胜感激。

0 个答案:

没有答案