我一直在研究这个分类器,看起来它几乎可以工作。我遇到的唯一问题是测试集。
train = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'labeledTrainData.tsv'), header=0, \
delimiter="\t", quoting=3)
test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'testData.tsv'), header=0, delimiter="\t", \
quoting=3)
documents = []
for review in train.values:
sentiment = 'pos' if review[1] == 1 else 'neg'
split = review[2].split(), sentiment
for word in split[0]:
word = re.sub(r'[^\w\s]', '', word)
documents.append(split)
word_features = nltk.FreqDist(chain(*[i for i, j in documents]))
word_features = list(word_features.keys())[:100]
train_set = [({i: (i in tokens) for i in word_features}, tag) for tokens, tag in documents[:1000]]
classifier = nltk.NaiveBayesClassifier
classifier.train(train_set)
print(nltk.classify.accuracy(classifier, test))
classifier.show_most_informative_features(5)
我发现的例子中,有一组正在使用和分组进行训练,比如90/10比率。在这里,我实际上有两个不同的数据集(一个标记,一个测试)。
train_set(下面显示的缩短版本)是一个包含bool值的元组列表,说明该单词是否在word_keys中,以及评论是正面还是负面:
[({'beautician,': False, 'hubris,': False, '/>BTW:': False, 'nondenominational': False, 'diapered,': False, 'matter).': False, 'fascist\\"': False, 'Russian,gay': False, '/>\\"Ladies': False, 'purport': False, 'locker-room': False, 'Enjoy"': False, 'exposition': False, 'decisions\\"': False, 'N(n***as)': False, 'Duhllywood),': False, 'cataclysmic': False, 'reviews,': False, 'marry;': False, 'Gordon),': False, 'now-nostalgic': False, 'avoid!!!!"': False, 'coin;': False, 'infiltrators': False, 'smalltime': False, "`knows'": False, 'callous': False, 'actors...it': False, 'Fox,': False, "'78": False, 'Givney': False, 'cinematography):': False, 'misconstrued,': False, 'bathing;': False, 'Hepburn,': False, 'noise,': False, 'BG´s.': False, 'ship.In': False, "'60s.)<br": False, 'Odder': False, 'holes,disgustingly': False, '/>contact': False, 'Croasdell': False, 'trips\\"': False, 'acting.Yet': False, 'firearm.': False, 'businesspeople': False, 'Tomilinson': False, 'ways...<br': False, 'cast...ouch.': False, "Alexandra's": False, "lost.'": False, 'anwers,': False, 'dissertation': False, 'Perry': False, 'phenom': False, '\\"Cleopatra\\",': False, '"Revolt': False, 'secured': False, "romance',": False, 'retentively': False, '/>1/2': False, 'photography/\\"You': False, 'did--': False, 'consulate': False, 'ocurred.': False, 'profession': False, 'insane.': False, 'hysterics)': False, 'UPN.<br': False, 'effects--after': False, 'IMAGE,': False, 'recognizable.<br': False, "Kinky'with": False, 'death\x97it': False, 'Wizard\\"': False, 'pemberton,': False, 'Belting': False, 'boast.': False, 'Schlock!!': False, 'filmed)': False, 'overplotted': False, 'wiring,': False, 'comedy)': False, '`SS': False, 'foibles.': False, 'Germna': False, 'Waverly': False, 'Oxford-educated': False, 'reviews.Anyway': False, 'SANE': False, 'expressively': False, 'cr*p.': False, 'ex-priest': False, 'ITC': False, '/>Sara': False, 'exoticism-oriented': False, "'hello'": False, '"......in': False, 'hesitates': False}, 'neg')]
虽然测试集仍然是这样:
id review
0 "12311_10" "Naturally in a film who's main themes are of ...
1 "8348_2" "This movie is a disaster within a disaster fi...
2 "5828_4" "All in all, this is a movie for kids. We saw ...
3 "7186_2" "Afraid of the Dark left me with the impressio...
4 "12128_7" "A very accurate depiction of small time mob l...
... ... ...
24997 "2531_1" "I was so disappointed in this movie. I am ver...
24998 "7772_8" "From the opening sequence, filled with black ...
24999 "11465_10" "This is a great horror film for people who do...
[25000 rows x 2 columns]
现在我得到的问题是我不能简单地训练这个数据集,原始看起来就像上面的test_set,只有这种情绪包含值1或0。 我将如何进行此培训并使用针对它的测试集?我知道有一些例子,但它与我正在做的不完全相同。
答案 0 :(得分:0)
测试集必须包含标签(答案)。 nltk的评估方法期望它,除非你已经有标签,否则真的没有办法衡量性能。按照您在示例中看到的标记集90-10分割,训练90%,并保留10%进行测试。