如何在NLTK中使用书籍功能(例如concoordance)?

时间:2013-07-18 21:47:38

标签: python nlp nltk

我正在浏览wonderful tutorial

我下载了一个名为book的集合:

>>> import nltk
>>> nltk.download()

和导入的文字:

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811

然后我可以在这些文本上运行命令:

>>> text1.concordance("monstrous")

如何在我自己的数据集上运行这些nltk命令?这些集合是否与python中的对象book相同?

2 个答案:

答案 0 :(得分:4)

你是对的,很难找到book.py模块的文档。因此,我们必须弄清楚并查看代码(参见here)。看book.py,用书模块做一致和所有花哨的东西:

首先您必须将原始文本放入nltk的corpus课程中,有关详细信息,请参阅Creating a new corpus with NLTK

其次您将语料库单词读入NLTK的Text课程。然后,您可以使用在http://nltk.org/book/ch01.html

中看到的功能
from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text

# For example, I create an example text file
text1 = '''
This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars.
'''
text2 = '''
One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep.
'''
# Creating the corpus
corpusdir = './mycorpus/' 
with (corpusdir+'text1.txt','w') as fout:
    fout.write(text1)
with (corpusdir+'text2.txt','w') as fout:
    fout.write(text2, fout)

# Read the the example corpus into NLTK's corpus class.
mycorpus = PlaintextCorpusReader(corpusdir, '.*')

# Read the NLTK's corpus into NLTK's text class, 
# where your book-like concoordance search is available
mytext = Text(mycorpus.words())

mytext.concoordance('foo')

注意:您可以使用其他NLTK的CorpusReaders甚至指定自定义段落/句子/单词标记符和编码但现在,我们将坚持默认

答案 1 :(得分:2)

使用来自bogs.princeton.edu的NLTK Cheatsheet进行文本分析 https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf

使用您自己的文字:

打开文件进行阅读

file = open('myfile.txt') 

在启动Python之前,请确保您位于正确的目录中 - 或者给出完整的路径规范。

阅读文件:

t = file.read() 

对文本进行标记:

tokens = nltk.word_tokenize(t)

转换为NLTK文本对象:

text = nltk.Text(tokens)