Question

这是我正在尝试的代码，但是代码生成错误。

import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 
stop_words = set(stopwords.words('english')) 

file_content = open("Dictionary.txt").read()
tokens = nltk.word_tokenize(file_content)

# sent_tokenize is one of instances of 
# PunktSentenceTokenizer from the nltk.tokenize.punkt module 

tokenized = sent_tokenize(tokens) 
for i in tokenized: 

    # Word tokenizers is used to find the words 
    # and punctuation in a string 
    wordsList = nltk.word_tokenize(i) 

    # removing stop words from wordList 
    wordsList = [w for w in wordsList if not w in stop_words] 

    # Using a Tagger. Which is part-of-speech 
    # tagger or POS-tagger. 
    tagged = nltk.pos_tag(wordsList) 

    print(tagged)

错误：

回溯（最近一次通话最后一次）：文件“ tag.py”，第12行   令牌化= send_tokenize（令牌）文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/init.py”，

第105行，在send_tokenize中返回tokenizer.tokenize（text）文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”，

第1269行，在标记化返回列表（self.sentences_from_text（文本，   realign_boundaries））文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”，

第1323行，在sences_from_text中返回[text [s：e] for s，e in   self.span_tokenize（text，realign_boundaries）]文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”，

第1323行，返回s的[text [s：e]   self.span_tokenize（text，realign_boundaries）]文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”，

第1313行，在span_tokenize中用于切片中的sl：文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”，

第1354行，位于_pair_iter（片）中sl1，sl2的_realign_boundaries中：   文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”，

第317行，在_pair_iter中prev = next（it）文件   “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”，

第1327行，在_slices_from_text中用于匹配   self._lang_vars.period_context_re（）。finditer（text）：TypeError：   预期的字符串或类似字节的对象

Answer 1

不知道您的代码应该做什么，但您得到的错误是由 tokens 变量的数据类型引起的。它需要字符串，但要获取其他数据类型的列表。

您应该将该行更改为：

tokens = str(nltk.word_tokenize(file_content))

当我将文本文件作为输入时，如何获得带有pos标签的文件作为输出？

1 个答案: