如何解决序列项0:预期的str实例,找到元组

时间:2018-05-06 12:19:23

标签: python nlp

我尝试使用nltk(下面的代码)进行一些POS_Tagging,当我尝试写入新文件时,我遇到了上述问题。如果我运行#fout.write("\n".join(tagged)),那么它会说出上述错误,并在我运行#fout.write(str.join(tagged))时解决这个问题并说明'join' requires a 'str' object but received a 'list'

文本文件是本地存储的,并且相对较大

from pathlib import Path
from nltk.tokenize import word_tokenize as wt
import nltk
import pprint

output_dir = Path ("\\Path\\")
output_file = (output_dir / "Token2290newsML.txt")

news_dir = Path("\\Path\\")
news_file = (news_dir / "2290newsML.txt")

tagged_dir = Path("\\Path\\")
tagged_file = (tagged_dir / "tagged2290newsML.txt")

file = open(news_file, "r")
data = file.readlines()

f = open(tagged_file, "w")

def process_content():
    try:
        for i in data:
            words = wt(i)
            pprint.pprint(words)
            tagged = nltk.pos_tag(words)
            pprint.pprint(tagged)
            #f.write("\n".join(tagged))
            f.write(str.join(tagged))

    except Exception as e:
        print(str(e))

process_content()
file.close()

任何帮助将不胜感激

谢谢:)

1 个答案:

答案 0 :(得分:1)

nltk.pos_tag()返回2元组列表。每个元组的第一个元素是单词,第二个元素是与单词对应的词性标记。例如:

>>> tagged = nltk.pos_tag('This is a test'.split())
>>> tagged
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN')]

假设您想每行写下每个单词及其标记:

with open(tagged_file, 'w') as f:
    for pair in tagged:
        print(' '.join(pair), file=f)

这将创建一个包含以下内容的文件:

This DT
is VBZ
a DT
test NN

您可以根据需要更改文件格式。