Python文件中的Python标记和POS标记

时间:2017-09-01 19:24:08

标签: python csv nlp pos-tagger

我是Python的新手,想从我的本地机器导入csv文件后进行POS标记。我从在线查找了一些资源,发现以下代码有效。

text = 'Senator Elizabeth Warren from Massachusetts announced her support of 
Social Security in Washington, D.C. on Tuesday. Warren joined other 
Democrats in support.'  
import nltk
from nltk import tokenize
sentences = tokenize.sent_tokenize(text)
sentences

from nltk.tokenize import TreebankWordTokenizer
texttokens = []
for sent in sentences:
 texttokens.append(TreebankWordTokenizer().tokenize(sent))
texttokens

from nltk.tag import pos_tag
taggedsentences = []
for sentencetokens in texttokens:
 taggedsentences.append(pos_tag(sentencetokens))
taggedsentences

print(taggedsentences)

自从我打印出来之后,上面代码的结果就像这样。

[[('Senator', 'NNP'), ('Elizabeth', 'NNP'), ('Warren', 'NNP'), ('from', 
'IN'), ('Massachusetts', 'NNP'), ('announced', 'VBD'), ('her', 'PRP$'), 
('support', 'NN'), ('of', 'IN'), ('Social', 'NNP'), ('Security', 'NNP'), 
('in', 'IN'), ('Washington', 'NNP'), (',', ','), ('D.C.', 'NNP'), ('on', 
'IN'), ('Tuesday', 'NNP'), ('.', '.')], [('Warren', 'NNP'), ('joined', 
'VBD'), ('other', 'JJ'), ('Democrats', 'NNPS'), ('in', 'IN'), ('support', 
'NN'), ('.', '.')]]

这是我想得到的理想结果,但我想在导入包含多行的csv文件后得到结果(在每一行中,有几个句子。)。例如,csv文件如下所示:

---------------------------------------------------------------
I like this product. This product is beautiful. I love it. 
---------------------------------------------------------------
This product is awesome. It have many convenient features.
---------------------------------------------------------------
I went this restaurant three days ago. The food is too bad.
---------------------------------------------------------------

最后,我想在导入csv文件后保存上面显示的理想的pos标记结果。我想将每行中的(pos标记)保存(写入)作为csv格式。

可能有两种格式。第一个可能如下(没有标题,每行(pos标记)句子在一行)。

----------------------------------------------------------------------------
[[('I', 'PRON'), ('like', 'VBD'), ('this', 'PRON'), ('product', 'NN')]]
----------------------------------------------------------------------------
[[('This', 'PRON'), ('product', 'NN'), ('is', 'VERB'), ('beautiful', 'ADJ')]]
---------------------------------------------------------------------------
[[('I', 'PRON'), ('love', 'VERB'), ('it', 'PRON')]]
----------------------------------------------------------------------------
...

第二种格式可能如下所示(没有标题,每组令牌和pos标记符保存在一个单元格中):

----------------------------------------------------------------------------
('I', 'PRON')    | ('like', 'VBD')   | ('this', 'PRON') | ('product', 'NN')
----------------------------------------------------------------------------
('This', 'PRON') | ('product', 'NN') | ('is', 'VERB')   | ('beautiful', 'ADJ')
---------------------------------------------------------------------------
('I', 'PRON')    | ('love', 'VERB')  | ('it', 'PRON')   |
----------------------------------------------------------------------------
...

我更喜欢第二种格式到第一种格式。

我在这里写的python代码完美有效,但我想为csv文件做同样的事情,最后将它保存在我的本地机器上。

这样做的最终目的是我想从句子中仅提取单词的名词类型(例如,NN,NNP)。

有人可以帮我解决如何修复python代码吗?

1 个答案:

答案 0 :(得分:-1)

请参阅此处已回答的问题。您可以执行一些标记来过滤掉帖子中描述的名词。SO Link