NLTK无法在python2程序中标记标点符号

时间:2017-07-25 06:05:05

标签: python python-2.7

我正在尝试使用tahake释义库,并且根据我需要的语法来标记句子中的单词,我使用以下代码部分地做了这些:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import nltk
from nltk.tag import pos_tag

text = '''The wife of a former U.S. president Bill Clinton Hillary Clinton visited China last Monday. Hillary Clinton wanted to visit China last month But postponed her plans till Monday last week. Hillary Clinton paid a visit to the People Republic of China on Monday. Last week the Secretary of State Ms Clinton visited Chinese officials.'''

sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
text = []
for sentence in sentences:    
    posTagges = pos_tag(nltk.word_tokenize(sentence))
    text = text + [" ".join([k + '/' + v for k,v in posTagges])]
print text

我得到了以下输出:

   /&lt; DT /妻子/ NN / IN / DT前/ JJ U.S./NNP总裁/ NN Bill / NNP   克林顿/ NNP希拉里/ NNP克林顿/ NNP访问/ VBD中国/ NNP最后/ JJ   星期一/ NNP ./.' ;,&#39; Hillary / NNP Clinton / NNP想要/ VBD到/ TO访问/ VB   中国/ NNP最后/ JJ月/ NN但/ CC推迟/ VBD她/ PRP $计划/ NNS   直到/ VBP星期一/ NNP最后/ JJ周/ NN ./.',&#39; Hillary / JJ Clinton / NNP   付费/ VBD a / DT访问/ NN到/ DT人民/ NNP共和国/ NNP / IN   中国/ NNP on / IN Monday / NNP ./.' ;,&#39; Last / JJ周/ NN / DT秘书/ NNP   / IN州/ NNP Ms / NNP Clinton / NNP访问过/ VBD中文/ JJ   官员/ NNS ./.']

现在我面临的问题是标记.或其他标点符号。我看到的是./.,而我需要./PUNCT

请帮助我,理念。

1 个答案:

答案 0 :(得分:1)

使用string.punctuation

In [150]: string.punctuation
Out[150]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

[" ".join([k + '/PUCNT' if k in string.punctuation else k + '/' + v for k,v in posTagges])]