Question

我正在测试NLTK并尝试训练punkttokenizer，我正在尝试获得GWBush-2005＆amp; 6州联盟的演讲，我得到的是Lazycorpusloader不可调用。

代码：

select * from xyz_abc;

错误：

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union("2005-GWBush.txt")
sample_text = state_union("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print tagged
    except Exception as e:
            print(str(e))

process_content()

Answer 1

想出来。因为我使用python 2.7它有点不同。我手动调用单词，然后将它们编码为ascii或UTF-8，因为它从网站上删除并且通常采用unicode格式，因此需要我对其进行编码。

这是所需的代码snippit。

train_text = nltk.corpus.state_union.words("2005-GWBush.txt")
sample_text = nltk.corpus.state_union.words("2006-GWBush.txt")

for words in train_text:
    train_text = words.encode("ascii")
for wordes in sample_text:
    sample_text = wordes.encode("ascii")

Answer 2

   from nltk.corpus import state_union
   nltk.download("state_union")
   nltk.download("averaged_perceptron_tagger")
  from nltk.tokenize import PunktSentenceTokenizer
  train_text = state_union.raw("2005-GWBush.txt")
  sample_text = state_union.raw("2006-GWBush.txt")

 custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

 tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
try:
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)
    except Exception as e:
        print(str(e))

process_content()

尝试一下...使用.raw来使用state_union中的文本数据

TypeError：'LazyCorpusLoader'对象不可调用

2 个答案: