我正在尝试使用tahake
释义库,并且根据我需要的语法来标记句子中的单词,我使用以下代码部分地做了这些:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import nltk
from nltk.tag import pos_tag
text = '''The wife of a former U.S. president Bill Clinton Hillary Clinton visited China last Monday. Hillary Clinton wanted to visit China last month But postponed her plans till Monday last week. Hillary Clinton paid a visit to the People Republic of China on Monday. Last week the Secretary of State Ms Clinton visited Chinese officials.'''
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
text = []
for sentence in sentences:
posTagges = pos_tag(nltk.word_tokenize(sentence))
text = text + [" ".join([k + '/' + v for k,v in posTagges])]
print text
我得到了以下输出:
> /&lt; DT /妻子/ NN / IN / DT前/ JJ U.S./NNP总裁/ NN Bill / NNP 克林顿/ NNP希拉里/ NNP克林顿/ NNP访问/ VBD中国/ NNP最后/ JJ 星期一/ NNP ./.' ;,&#39; Hillary / NNP Clinton / NNP想要/ VBD到/ TO访问/ VB 中国/ NNP最后/ JJ月/ NN但/ CC推迟/ VBD她/ PRP $计划/ NNS 直到/ VBP星期一/ NNP最后/ JJ周/ NN ./.',&#39; Hillary / JJ Clinton / NNP 付费/ VBD a / DT访问/ NN到/ DT人民/ NNP共和国/ NNP / IN 中国/ NNP on / IN Monday / NNP ./.' ;,&#39; Last / JJ周/ NN / DT秘书/ NNP / IN州/ NNP Ms / NNP Clinton / NNP访问过/ VBD中文/ JJ 官员/ NNS ./.']
现在我面临的问题是标记.
或其他标点符号。我看到的是./.
,而我需要./PUNCT
请帮助我,理念。
答案 0 :(得分:1)
使用string.punctuation
In [150]: string.punctuation
Out[150]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
[" ".join([k + '/PUCNT' if k in string.punctuation else k + '/' + v for k,v in posTagges])]