Question

据我所知，使用NLTK分类器的例子：

它们似乎只能解决句子本身的功能。所以，你有......

corpus = 
[
("This is a sentence"),
("This is another sentence")
]

...并且您将一些函数（如count_words_ending_in_a_vowel（））应用于句子本身。

相反，我想将一段外部数据应用于句子，而不是从文本本身派生的东西，而是外部标签，如：

corpus = 
[
("This is a sentence", "awesome"),
("This is another sentence", "not awesome")
]

或者

corpus = 
[
{"text": "This is a sentence", "label": "awesome"},
{"text": "This is another sentence", "label": "not awesome"}
]

（如果我可能有多个外部标签。）

我的问题是：鉴于我的数据集中包含这些外部标签，如何将语料库重新格式化为NaiveBayesClassifier.train()期望的格式？我知道我还需要在上面的“text”字段中应用tokenizer ---但是我应该在NaiveBayesClassifier.train函数中输入的总格式是什么？

申请

classifier = nltk.NaiveBayesClassifier.train(goods)
print(classifier.show_most_informative_features(32))

我更广泛的目标---我想看看差分词频率如何能够预测标签，哪些词组在将标签彼此分开时最具信息性。这种方式具有k-means的感觉，但我被告知我应该能够在NLTK中完全执行此操作，并且只是在将其描述为适当的数据输入格式时遇到了麻烦。

Answer 1

我使用以下方法取得了成功：

train = [({'some': True, 'tokens': True}, 'label'),
         ({'other': True, 'word': True}, 'different label'),
         ({'cool': True, 'document': True}, 'label')]
classifier = nltk.NaiveBayesClassifier.train(train)

所以train是一个文档列表（每个元组都是一个元组）。每个元组的第一个元素是标记字典（标记是键，值是True以指示该标记的存在），第二个元素是与文档关联的标签。

如何在NLTK分类器中使用元数据

1 个答案: