这是我正在尝试的代码,但是代码生成错误。
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
file_content = open("Dictionary.txt").read()
tokens = nltk.word_tokenize(file_content)
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(tokens)
for i in tokenized:
# Word tokenizers is used to find the words
# and punctuation in a string
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
错误:
回溯(最近一次通话最后一次):文件“ tag.py”,第12行 令牌化= send_tokenize(令牌)文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/init.py”,
第105行,在send_tokenize中返回tokenizer.tokenize(text)文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”,
第1269行,在标记化返回列表(self.sentences_from_text(文本, realign_boundaries))文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”,
第1323行,在sences_from_text中返回[text [s:e] for s,e in self.span_tokenize(text,realign_boundaries)]文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”,
第1323行,返回s的[text [s:e] self.span_tokenize(text,realign_boundaries)]文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”,
第1313行,在span_tokenize中用于切片中的sl:文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”,
第1354行,位于_pair_iter(片)中sl1,sl2的_realign_boundaries中: 文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”,
第317行,在_pair_iter中prev = next(it)文件 “ /home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py”,
第1327行,在_slices_from_text中用于匹配 self._lang_vars.period_context_re()。finditer(text):TypeError: 预期的字符串或类似字节的对象
答案 0 :(得分:0)
不知道您的代码应该做什么,但您得到的错误是由 tokens 变量的数据类型引起的。它需要字符串,但要获取其他数据类型的列表。
您应该将该行更改为:
tokens = str(nltk.word_tokenize(file_content))