我正在使用NLTK,并希望对与搭配有关的文本进行标记:例如,“纽约”应该是单个标记,而幼稚标记化则会分割“新”和“约克”。
我知道如何找到搭配以及如何标记,但无法找到如何将两者结合起来......
感谢。
答案 0 :(得分:1)
似乎适合您的方法称为命名实体识别。有许多专门用于命名实体识别的NLTK的链接。我只是给你一个here
的例子from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk
def extract_entities(text):
entities = []
for sentence in sent_tokenize(text):
chunks = ne_chunk(pos_tag(word_tokenize(sentence)))
entities.extend([chunk for chunk in chunks if hasattr(chunk, 'node')])
return entities
if __name__ == '__main__':
text = """
A multi-agency manhunt is under way across several states and Mexico after
police say the former Los Angeles police officer suspected in the murders of a
college basketball coach and her fiancé last weekend is following through on
his vow to kill police officers after he opened fire Wednesday night on three
police officers, killing one.
"In this case, we're his target," Sgt. Rudy Lopez from the Corona Police
Department said at a press conference.
The suspect has been identified as Christopher Jordan Dorner, 33, and he is
considered extremely dangerous and armed with multiple weapons, authorities
say. The killings appear to be retribution for his 2009 termination from the
Los Angeles Police Department for making false statements, authorities say.
Dorner posted an online manifesto that warned, "I will bring unconventional
and asymmetrical warfare to those in LAPD uniform whether on or off duty."
"""
print extract_entities(text)
输出:
[Tree('GPE', [('Mexico', 'NNP')]), Tree('GPE', [('Los', 'NNP'), ('Angeles', 'NNP')]), Tree('PERSON', [('Rudy', 'NNP')]), Tree('ORGANIZATION', [('Lopez', 'NNP')]), Tree('ORGANIZATION', [('Corona', 'NNP')]), Tree('PERSON', [('Christopher', 'NNP'), ('Jordan', 'NNP'), ('Dorner', 'NNP')]), Tree('GPE', [('Los', 'NNP'), ('Angeles', 'NNP')]), Tree('PERSON', [('Dorner', 'NNP')]), Tree('GPE', [('LAPD', 'NNP')])]
另一种方法 - 使用两种信息重叠的不同度量 随机变量,例如Mutual Information,Pointwise Mutual 信息,t检验等。 << {{}}>>中有一个很好的介绍。作者:Christopher D. Manning和HinrichSchütze。第5章搭配可供下载。这个Foundations of Statistical Natural Language Processing - 用NLTK提取搭配的例子。