我正在研究tensorflow和一些高级API,比如tflearn。
我在这里要做的是在IMDB数据上使用lstm进行情绪分析。以下链接中有一个示例代码 https://github.com/tflearn/tflearn/blob/master/examples/nlp/lstm.py
然而,它使用预处理数据,但我想使用自己的IMDB原始数据(从http://ai.stanford.edu/~amaas/data/sentiment/下载)
以下是我为情绪分析更新的代码,所有中间步骤看似正确,但准确性不稳定(如下所示)。当我在最后打印预测时,我看到每个类的概率非常接近(如[[0.4999946355819702,0.5000053644180298],[0.5000001192092896,0.49999988079071045],[0.49999362230300903,0.5000064373016357],[0.49999985098838806,0.5000001192092896]])。
我不认为问题是过度拟合,因为当我尝试再次预测列车数据时,结果如上所述。我想我错过了一些观点或做错了什么。
感谢任何帮助, 感谢
# -*- coding: utf-8 -*-
from __future__ import division, print_function, absolute_import
import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
import string
import numpy as nm
import codecs
import re
import collections
import math
import tensorflow as tf
import random
import glob
allWords = []
allDocuments = []
allLabels = []
def readFile(fileName, allWords):
file = codecs.open(fileName, encoding='utf-8')
for line in file:
line = line.lower().encode('utf-8')
words = line.split()
for word in words:
word = word.translate(None, string.punctuation)
if word != '':
allWords.append(word)
file.close()
def readFileToConvertWordsToIntegers(dictionary, fileName, allDocuments, allLabels, label):
file = codecs.open(fileName, encoding='utf-8')
document = []
for line in file:
line = line.lower().encode('utf-8')
words = line.split()
for word in words:
word = word.translate(None, string.punctuation)
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
document.append(index)
allDocuments.append(document)
allLabels.append(label)
file.close()
vocabulary_size = 10000
def build_dataset(words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count = unk_count + 1
data.append(index)
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return dictionary, reverse_dictionary
fileList = glob.glob("/Users/inanc/Desktop/aclImdb/train/neg/*.txt")
for file in fileList:
readFile(file, allWords)
fileList = glob.glob("/Users/inanc/Desktop/aclImdb/test/train/*.txt")
for file in fileList:
readFile(file, allWords)
print(len(allWords))
dictionary, reverse_dictionary = build_dataset(allWords)
del allWords # Hint to reduce memory.
print(len(dictionary))
fileList = glob.glob("/Users/inanc/Desktop/aclImdb/train/neg/*.txt")
for file in fileList:
readFileToConvertWordsToIntegers(dictionary, file, allDocuments, allLabels, 0)
fileList = glob.glob("/Users/inanc/Desktop/aclImdb/train/pos/*.txt")
for file in fileList:
readFileToConvertWordsToIntegers(dictionary, file, allDocuments, allLabels, 1)
print(len(allDocuments))
print(len(allLabels))
c = list(zip(allDocuments, allLabels)) # shuffle them partitioning
random.shuffle(c)
allDocuments, allLabels = zip(*c)
trainX = allDocuments[:22500]
testX = allDocuments[22500:]
trainY = allLabels[:22500]
testY = allLabels[22500:]
#counter=collections.Counter(trainY)
#print(counter)
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
trainX = pad_sequences(trainX, maxlen=100, value=0.)
testX = pad_sequences(testX, maxlen=100, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, 100])
net = tflearn.embedding(net, input_dim=vocabulary_size, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=32)
predictions = model.predict(trainX)
print(predictions)
结果:
--
Training Step: 704 | total loss: 1.38629
Training Step: 704 | total loss: 1.38629: 0.4698 | val_loss: 1.38629 - val_acc:| Adam | epoch: 001 | loss: 1.38629 - acc: 0.4698 | val_loss: 1.38629 - val_acc: 0.4925 -- iter: 22500/22500
--
Training Step: 1408 | total loss: 1.38629
Training Step: 1408 | total loss: 1.38629 0.8110 | val_loss: 1.38629 - val_acc:| Adam | epoch: 002 | loss: 1.38629 - acc: 0.8110 | val_loss: 1.38629 - val_acc: 0.9984 -- iter: 22500/22500
--
Training Step: 1620 | total loss: 1.38629
Training Step: 2112 | total loss: 1.38629 0.8306 -- iter: 06784/22500
Training Step: 2112 | total loss: 1.38629 0.6303 | val_loss: 1.38629 - val_acc:| Adam | epoch: 003 | loss: 1.38629 - acc: 0.6303 | val_loss: 1.38629 - val_acc: 0.7382 -- iter: 22500/22500
--
Training Step: 2816 | total loss: 1.38629
Training Step: 2816 | total loss: 1.38629 0.5489 | val_loss: 1.38629 - val_acc:| Adam | epoch: 004 | loss: 1.38629 - acc: 0.5489 | val_loss: 1.38629 - val_acc: 0.2904 -- iter: 22500/22500
--
Training Step: 3520 | total loss: 1.38629
Training Step: 3520 | total loss: 1.38629 0.4848 | val_loss: 1.38629 - val_acc:| Adam | epoch: 005 | loss: 1.38629 - acc: 0.4848 | val_loss: 1.38629 - val_acc: 0.7828 -- iter: 22500/22500
--
Training Step: 4224 | total loss: 1.38629
Training Step: 4224 | total loss: 1.38629 0.5233 | val_loss: 1.38629 - val_acc:| Adam | epoch: 006 | loss: 1.38629 - acc: 0.5233 | val_loss: 1.38629 - val_acc: 0.9654 -- iter: 22500/22500
--
Training Step: 4928 | total loss: 1.38629
Training Step: 4928 | total loss: 1.38629 0.4400 | val_loss: 1.38629 - val_acc:| Adam | epoch: 007 | loss: 1.38629 - acc: 0.4400 | val_loss: 1.38629 - val_acc: 0.6725 -- iter: 22500/22500
--
Training Step: 5632 | total loss: 1.38629
Training Step: 5632 | total loss: 1.38629 0.4319 | val_loss: 1.38629 - val_acc:| Adam | epoch: 008 | loss: 1.38629 - acc: 0.4319 | val_loss: 1.38629 - val_acc: 0.5808 -- iter: 22500/22500
--
Training Step: 6336 | total loss: 1.38629
Training Step: 6336 | total loss: 1.38629 0.4765 | val_loss: 1.38629 - val_acc:| Adam | epoch: 009 | loss: 1.38629 - acc: 0.4765 | val_loss: 1.38629 - val_acc: 0.4833 -- iter: 22500/22500
--
Training Step: 7040 | total loss: 1.38629
Training Step: 7040 | total loss: 1.38629 0.5203 | val_loss: 1.38629 - val_acc:| Adam | epoch: 010 | loss: 1.38629 - acc: 0.5203 | val_loss: 1.38629 - val_acc: 0.2373 -- iter: 22500/22500
答案 0 :(得分:1)
哦,这是我的坏事。我输入了
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
行两次,因此之后只存在一个类别。删除重复的行后,问题已经解决。