使用带有Python的POS标记提取名词(循环)

时间:2017-09-22 14:41:40

标签: python text tagging

我想从巨大的文本文件中提取名词或名词组。 下面的python代码工作正常,但只提取最后一行的名词。 我很确定代码需要'追加'但不知道如何(我是python的初学者。)

import nltk
import pos_tag
import nltk.tokenize 
import numpy

f = open(r'infile.txt', encoding="utf8")
data = f.readlines()

tagged_list = []

for line in data:
    tokens = nltk.word_tokenize(line)
    tagged = nltk.pos_tag(tokens)
    nouns = [word for word,pos in tagged \
            if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
    downcased = [x.lower() for x in nouns]
    joined = " ".join(downcased).encode('utf-8')
    into_string = str(nouns)

output = open(r"outfile.csv", "wb")
output.write(joined)
output.close()

结果如下:公寓交通市中心,这是文件最后一行的名词。我想将文件的每一行的名词保存在一行中。例如,输入文件和相应的结果应如下所示。

Input file:
I like the milk.
I like the milk and bread.
I like the milk, bread, and butter.

Output file:
milk
milk bread
milk bread butter

希望有人帮助修复上面的代码。

1 个答案:

答案 0 :(得分:2)

添加for循环的行尾,然后将其写入文件。

...
result = ""
for line in data:
    ...
    result += joined

output = open(r"outfile.csv", "w")
output.write(str(result))
output.close()

如果你想使用追加:

...
result_list = []
for line in data:
    ...
    result_list.append(joined)

output = open(r"outfile.csv", "w")
output.write(str(result_list))
output.close()

此外,如果您使用结果列表,则可以使用此书写方式:

...
output = open(r"outfile.csv", "w")
for item in result_list:
    output.write(str(item) + "\n")
output.close()