Question

我正在通过nltk学习很多关于自然语言处理的知识，可以做很多事情，但是我无法找到从包中读取文本的方法。我尝试过这样的事情：

UpdateLayerTree

但它似乎不起作用，因为它没有fileid。之前似乎没有人问过这个问题，所以我认为答案应该很简单。你知道阅读这些文本的方法是什么，或者如何将它们转换成字符串？提前致谢

Answer 1

让我们深入研究代码=）

首先，nltk.book代码位于https://github.com/nltk/nltk/blob/develop/nltk/book.py

如果我们仔细查看，则会将文本作为nltk.Text个对象加载，例如来自https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36的text6：

text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")

Text对象来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286，您可以在http://www.nltk.org/book/ch02.html

中详细了解如何使用它

webtext是来自nltk.corpus的语料库，因此要获得nltk.book.text6的原始文本，您可以直接加载网络文字，例如

>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')

fileids仅在您加载PlaintextCorpusReader对象时出现，而不是从Text对象（已处理对象）加载：

>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
...     print(filename)
... 
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt

Answer 2

看起来他们已经把它分解成了令牌。

from nltk.book import text6

text6.tokens

Answer 3

#generate 排序标记

print(sorted(set(text6))

如何从Python中读取nltk.book中的nltk.text.Text文件？

3 个答案: