我正在尝试创建用于情感分析的RNN。我的训练数据是“ Sentiment140”数据集。
问题在于,尽管代码可以运行,但运行速度非常慢(尽管这可能是由于我的计算机造成的),并且精度似乎仍然停滞在50%左右,这或多或少是随机的。
我正在使用“ word2vec”系统预处理数据,其中每个单词都表示为一个热数组。然后将这些数组放置在另一个数组中,该数组随后将完全代表一条推文。每个单词数组也被填充,因此它们的长度都相同。
class Tweet:
def __init__(self, line):
self.line = line
def Process(self):
#using file structure
# polarity = int(self.line.split(",")[0][1:])
polarity = int(self.line[0])
if polarity == 0:
polarity = 0
elif polarity == 2:
polarity = 1
elif polarity == 4:
polarity = 2
words = self.line[-1].lower()
#removes punctuation etc
words = clean(words)
TokenisedText = nltk.word_tokenize(words)
# lemmatised, tokenised text for entire line
TokenisedText = [WordNetLemmatizer().lemmatize(word) for word in TokenisedText]
WordVectors = np.zeros(shape=(MaxWordsLength, LexiconLength))
for i in range(0,len(TokenisedText)):
word = TokenisedText[i]
WordVector = np.zeros(LexiconLength)
index = np.where(lexicon == word)
WordVector[index] = 1
# same as += 1 here as max value for any one element is 1 - one hot array
# may be inefficient to append
WordVectors[i] = WordVector
return WordVectors, polarity
# tweet structure [ [1,0,0,...], ... (incl. padding) ], ], [0,1,0,...], ... (incl. padding) ], ... ]
# polarity - 0, 1 or 2
#preprocesses the data and yields in batches
def GrabTrainingData():
with open(ShuffledTrainingPath, "r", buffering=200000, encoding="latin-1") as InputFile:
i = 0
while True:
# BatchSize: how many tweets are you passing at one time?
# MaxWordsLength: how many word arrays in each tweet? ADDS PADDING TO ENSURE UNIFORM TENSOR SIZES
# DOES THIS MEAN CAN'T ANALYSE ANYTHING LONGER THAN MaxWordsLength?!?!
# LexiconLength: how long is each word array?
BatchTrain_x = np.zeros(shape=(BatchSize, MaxWordsLength, LexiconLength))
BatchTrain_y = np.zeros(BatchSize)
reader = csv.reader(InputFile)
for line in reader:
WordVectors, polarity = Tweet(line).Process()
BatchTrain_x[i] = WordVectors
BatchTrain_y[i] = polarity
i+=1
if i>= BatchSize:
i = 0
yield BatchTrain_x, BatchTrain_y
# loops file again
i=0
#neural network model
def RNN():
model = keras.models.Sequential()
NumberOfClassifications = 3
model.add(keras.layers.CuDNNLSTM(128, input_shape=(MaxWordsLength, LexiconLength), return_sequences=True))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.CuDNNLSTM(128))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(32, activation="relu"))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(NumberOfClassifications, activation="softmax"))
MyOptimiser = keras.optimizers.Adam(lr=1e-3, decay=1e-5)
model.compile(loss="sparse_categorical_crossentropy", optimizer = MyOptimiser, metrics=["accuracy"] )
x_test, y_test = ProcessTestingData(ShuffledTestingPath)
model.fit_generator(generator=GrabTrainingData(), epochs=3, steps_per_epoch=math.ceil(TrainingNumberOfLines / BatchSize), validation_data=(x_test, y_test))
model.save(os.path.join(sys.path[0], "MyRNN.h5"))
P.S。附带说明一下,有人可以解释一下如何在数据生成器中正确使用use_multiprocessing吗?我知道我必须使用keras.layers.Sequence(),但是我不确定如何使用。
任何帮助将不胜感激。