Python NLTK格式化测试集

时间:2017-01-01 14:33:56

标签: python nltk

我一直在研究这个分类器,看起来它几乎可以工作。我遇到的唯一问题是测试集。

train = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'labeledTrainData.tsv'), header=0, \
                    delimiter="\t", quoting=3)

test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'testData.tsv'), header=0, delimiter="\t", \
               quoting=3)

documents = []
for review in train.values:
    sentiment = 'pos' if review[1] == 1 else 'neg'
    split = review[2].split(), sentiment
    for word in split[0]:
        word = re.sub(r'[^\w\s]', '', word)
    documents.append(split)

word_features = nltk.FreqDist(chain(*[i for i, j in documents]))
word_features = list(word_features.keys())[:100]

train_set = [({i: (i in tokens) for i in word_features}, tag) for tokens, tag in documents[:1000]]

classifier = nltk.NaiveBayesClassifier
classifier.train(train_set)

print(nltk.classify.accuracy(classifier, test))
classifier.show_most_informative_features(5)

我发现的例子中,有一组正在使用和分组进行训练,比如90/10比率。在这里,我实际上有两个不同的数据集(一个标记,一个测试)。

train_set(下面显示的缩短版本)是一个包含bool值的元组列表,说明该单词是否在word_keys中,以及评论是正面还是负面:

 [({'beautician,': False, 'hubris,': False, '/>BTW:': False, 'nondenominational': False, 'diapered,': False, 'matter).': False, 'fascist\\"': False, 'Russian,gay': False, '/>\\"Ladies': False, 'purport': False, 'locker-room': False, 'Enjoy"': False, 'exposition': False, 'decisions\\"': False, 'N(n***as)': False, 'Duhllywood),': False, 'cataclysmic': False, 'reviews,': False, 'marry;': False, 'Gordon),': False, 'now-nostalgic': False, 'avoid!!!!"': False, 'coin;': False, 'infiltrators': False, 'smalltime': False, "`knows'": False, 'callous': False, 'actors...it': False, 'Fox,': False, "'78": False, 'Givney': False, 'cinematography):': False, 'misconstrued,': False, 'bathing;': False, 'Hepburn,': False, 'noise,': False, 'BG´s.': False, 'ship.In': False, "'60s.)<br": False, 'Odder': False, 'holes,disgustingly': False, '/>contact': False, 'Croasdell': False, 'trips\\"': False, 'acting.Yet': False, 'firearm.': False, 'businesspeople': False, 'Tomilinson': False, 'ways...<br': False, 'cast...ouch.': False, "Alexandra's": False, "lost.'": False, 'anwers,': False, 'dissertation': False, 'Perry': False, 'phenom': False, '\\"Cleopatra\\",': False, '"Revolt': False, 'secured': False, "romance',": False, 'retentively': False, '/>1/2': False, 'photography/\\"You': False, 'did--': False, 'consulate': False, 'ocurred.': False, 'profession': False, 'insane.': False, 'hysterics)': False, 'UPN.<br': False, 'effects--after': False, 'IMAGE,': False, 'recognizable.<br': False, "Kinky'with": False, 'death\x97it': False, 'Wizard\\"': False, 'pemberton,': False, 'Belting': False, 'boast.': False, 'Schlock!!': False, 'filmed)': False, 'overplotted': False, 'wiring,': False, 'comedy)': False, '`SS': False, 'foibles.': False, 'Germna': False, 'Waverly': False, 'Oxford-educated': False, 'reviews.Anyway': False, 'SANE': False, 'expressively': False, 'cr*p.': False, 'ex-priest': False, 'ITC': False, '/>Sara': False, 'exoticism-oriented': False, "'hello'": False, '"......in': False, 'hesitates': False}, 'neg')]

虽然测试集仍然是这样:

               id                                             review
0      "12311_10"  "Naturally in a film who's main themes are of ...
1        "8348_2"  "This movie is a disaster within a disaster fi...
2        "5828_4"  "All in all, this is a movie for kids. We saw ...
3        "7186_2"  "Afraid of the Dark left me with the impressio...
4       "12128_7"  "A very accurate depiction of small time mob l...
...           ...                                                ...
24997    "2531_1"  "I was so disappointed in this movie. I am ver...
24998    "7772_8"  "From the opening sequence, filled with black ...
24999  "11465_10"  "This is a great horror film for people who do...

[25000 rows x 2 columns]

现在我得到的问题是我不能简单地训练这个数据集,原始看起来就像上面的test_set,只有这种情绪包含值1或0。 我将如何进行此培训并使用针对它的测试集?我知道有一些例子,但它与我正在做的不完全相同。

1 个答案:

答案 0 :(得分:0)

测试集必须包含标签(答案)。 nltk的评估方法期望它,除非你已经有标签,否则真的没有办法衡量性能。按照您在示例中看到的标记集90-10分割,训练90%,并保留10%进行测试。