Question

我试图在父类别下创建另一个类别。有可能创造。它是如何完成的以及如何引用这些子类别？

Answer 1

CategorizedCorpusReader仅支持一个级别的类别。但由于类别基于文件名，因此您可以自由设置自己的名称/类别方案，并根据需要过滤语料库文件。

您想如何使用多级别类别？如果您有后续问题，请说明您希望实现的目标以及迄今为止所尝试的内容。

Answer 2

对语料库进行分类的最简单方法是为每个类别分配一个文件。以下是movie_reviews语料库的两个摘录：

movie_pos.txt

the thin red line is flawed but it provokes .

movie_neg.txt

a big-budget and glossy production can not make up for a lack of spontaneity that permeates their tv show .

有了这两个文件，我们将有两个类别：pos和neg。

我们将使用继承自CategorizedPlaintextCorpusReader和PlaintextCorpusReader的{{1}}。这两个超类需要三个参数：根目录，CategorizedCorpusReader和类别规范。

fileids

>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*\. txt', cat_pattern=r'movie_(\w+)\.txt') >>> reader.categories() ['neg', 'pos'] >>> reader.fileids(categories=['neg']) ['movie_neg.txt'] >>> reader.fileids(categories=['pos']) ['movie_pos.txt']的前两个参数是根目录和fileids，它们被传递给PlaintextCorpusReader以读取文件。 CategorizedPlaintextCorpusReader关键字参数是用于从fileids中提取类别名称的正则表达式。在我们的例子中，类别是movie_之后和.txt之前的fileid的一部分。该类别必须由分组括号括起来。 cat_pattern传递给cat_pattern，它会覆盖常见的语料库阅读器功能，例如CategorizedCorpusReader，fileids()，words()和sents()，以接受类别关键字参数。这样，您可以通过调用paras()获得所有pos句子。 reader.sents(categories=['pos'])还提供了categories（）函数，该函数返回语料库中所有已知类别的列表。

如何在NLTK Python中为语料库创建子类别

2 个答案: