我使用python 2.7 nltk标记来标记简单的英文文本,以便提取每个单词及其命名实体类别的频率。以下程序用于此目的:
import re
from collections import Counter
from nltk.tag.stanford import NERTagger
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))
WORD = re.compile(r'\w+')
def main ():
text = "title Optimal Play against Best Defence: Complexity and
Heuristics"
print text
words = WORD.findall(text)
print words
word_frqc = Counter(words)
tagger = ERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz",
"stanford-ner.jar")
terms = []
answer = tagger.tag(words)
print answer
for i, word_pos in enumerate(answer):
word, pos = word_pos
if pos == 'PERSON':
cat_Id = 1
elif pos == 'ORGANIZATION':
cat_Id = 2
elif pos == 'LOCATION':
cat_Id = 3
else:
cat_Id = 4
frqc =word_frqc.get(word)
terms.append( (i, word, cat_Id, frqc ))
print terms
if __name__ == '__main__':
main()
该计划的输出如下:
text = "title Optimal Play against Best **Defence:** Complexity and
Heuristics"
[(u'title', u'O'), (u'Optimal', u'O'), (u'Play', u'O'), (u'against', u'O'),
(u'Best', u'O'), (u'Defense', u'O'), (u'Complexity', u'O'), (u'and', u'O'),
(u'Heuristics', u'O')]
[(0, u'title', 4, 1), (1, u'Optimal', 4, 1), (2, u'Play', 4, 1), (3,
u'against', 4, 1), (4, u'Best', 4, 1), (5, u'**Defense**', 4, None), (6,
u'Complexity', 4, 1), (7, u'and', 4, 1), (8, u'Heuristics', 4, 1)]
有一个问题,是由tagger.tag()方法引起的。该方法改变了“防御”这个词。在原始文本中以防御'。因此,该计划无法看到“防御”这个词。在word_frqc中,因此将文本中单词的频率设置为None。
请问有没有办法(在python中)我可以让方法不改变单词?