Question

我开始使用NLTK＆amp; Python但我真的与NLTK语料库结构混淆了。例如

我无法理解为什么我们需要两次向nltk.corpus模块附加单词，

wordlist = [w代表nltk.corpus.words.words（'en'）中的w如果w.islower（）]
对于nltk.corpus.words和nltk.corpus.words.words，类型保持不同。为什么会这样？

型（nltk.corpus） nltk.corpus 类型（nltk.corpus.words） nltk.corpus.words 类型（nltk.corpus.words.words） nltk.corpus.words.words C：\\ Documents and Settings \\ Administrator \\ nltk_data \\ corpora \\ words'＆gt;＆gt;
第三，如何知道一个人需要在nltk.corpus两次附加单词才能生成一个单词表。我的意思是调用nltk.corpus.words和nltk.corpus.words.words之间的区别是什么？

有人可以详细说明。现在很难通过NLTK书的第三章继续。

非常感谢

Answer 1

这很简单，words是包含nltk.corpus的类实例的名称，相关代码：

words = LazyCorpusLoader('words', WordListCorpusReader, r'(?!README|\.).*')

所有这一切都说words是LazyCorpusLoader的实例。

所以你得到nltk.corpus.words作为参考。

但是等等！

如果您查看LazyCorpusLoader的代码，它还会使用LazyCorpusLoader调用WordListCorpusReader。

WordListCorpusReader碰巧有一个名为words的方法，它看起来像这样：

def words(self, fileids=None):
    return line_tokenize(self.raw(fileids))

而LazyCorpusLoader执行此操作corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)

基本上它的作用是使self.__reader__cls成为WordListCorpusReader的实例（它有自己的单词方法）。

然后它会这样做：

self.__dict__ = corpus.__dict__ 
self.__class__ = corpus.__class__

根据Python文档__dict__ is the module’s namespace as a dictionary object。因此它将命名空间更改为corpus的命名空间。同样，对于__class__，文档会说__class__ is the instance’s class，因此它也会更改类。因此，nltk.corpus.words.words的情况是指名为words的实例中包含的实例方法字。那有意义吗？此代码说明了相同的行为：

class Bar(object):
    def foo(self):
        return "I am a method of Bar"

class Foo(object):
    def __init__(self, newcls):
        newcls = newcls()
        self.__class__ = newcls.__class__
        self.__dict__ = newcls.__dict__

foo = Foo(Bar)
print foo.foo()

此处还有指向源代码的链接，以便您自己查看：

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus-pysrc.html

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordlist-pysrc.html#WordListCorpusReader

无法遵循NLTK语料库结构

1 个答案: