Question

我正在努力解决这个问题。编码新手。我正在尝试读取.txt文件，对其进行标记，pos标记其中的单词。

这是我到目前为止所得到的：

import nltk
from nltk import word_tokenize
import re

file = open('1865-Lincoln.txt', 'r').readlines()
text = word_tokenize(file)
string = str(text)
nltk.pos_tag(string)

我的问题是，它一直给我TypeError: expected string or bytes-like object错误。

Answer 1

word_tokenize期待一个字符串，但file.readlines（）会给你一个列表。只需将列表转换为字符串即可解决问题。

import nltk
from nltk import word_tokenize
import re

file = open('test.txt', 'r').readlines()
text =''
for line in file:
    text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)

Answer 2

我建议你做以下事情：

import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re

with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
    lines = f.readlines() # first get all the lines from file, store it
    for i in range(0, len(lines)): # for each line, do the following
        token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
        print (token_text) # for debug purposes
        pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
        print (pos_tagged_token)

对于包含以下内容的文本文件

用户在这里

传递是

输出结果为：

['user'，'is'，'here']

[（'user'，'NN'），（'is'，'VBZ'），（'here'，'RB'）]

['传递'，'是'，'有']

[（'pass'，'NN'），（'is'，'VBZ'），（'there'，'RB'）]

它适用于我，我使用的是Python 3.6，如果这很重要的话。希望这有帮助！

编辑1： 所以你的问题是你将字符串列表传递给pos_tag()，而doc说

词性标注器或POS标记器处理单词序列，并将词性标记附加到每个单词

因此你需要逐行传递它。即字符串。这就是您收到TypeError: expected string or bytes-like object错误的原因。

Answer 3

最有可能的是1865-Lincoln.txt指的是林肯总统的就职演说。它来自https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip

的NLTK

该文档的原始来源来自Inaugural Address Corpus

如果我们检查NLTK is reading the file using LazyCorpusReader的方式，我们会看到文件是Latin-1编码的。

inaugural = LazyCorpusLoader(
    'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')

如果您将默认编码设置为utf8，则很可能是TypeError: expected string or bytes-like object正在发生的位置

您应该使用显式编码打开文件并正确解码字符串，即

import nltk
from nltk import word_tokenize, pos_tag

tagged_lines = []
with open('test.txt', encoding='latin1') as fin:
    for line in fin:
        tagged_lines.append(pos_tag(word_tokenize(line)))

但从技术上讲，您可以直接访问inagural语料库作为NLTK中的语料库对象，即

>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]

从就职地址语料库中标记.txt文件

3 个答案: