我正在测试NLTK并尝试训练punkttokenizer,我正在尝试获得GWBush-2005& 6州联盟的演讲,我得到的是Lazycorpusloader不可调用。
代码:
select * from xyz_abc;
错误:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union("2005-GWBush.txt")
sample_text = state_union("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print tagged
except Exception as e:
print(str(e))
process_content()
答案 0 :(得分:0)
想出来。因为我使用python 2.7它有点不同。我手动调用单词,然后将它们编码为ascii或UTF-8,因为它从网站上删除并且通常采用unicode格式,因此需要我对其进行编码。
这是所需的代码snippit。
train_text = nltk.corpus.state_union.words("2005-GWBush.txt")
sample_text = nltk.corpus.state_union.words("2006-GWBush.txt")
for words in train_text:
train_text = words.encode("ascii")
for wordes in sample_text:
sample_text = wordes.encode("ascii")
答案 1 :(得分:0)
from nltk.corpus import state_union
nltk.download("state_union")
nltk.download("averaged_perceptron_tagger")
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
尝试一下...使用.raw来使用state_union中的文本数据