正则表达式方法捕获单字和双字专有名词

时间:2014-03-18 22:17:46

标签: python regex

我想出了以下内容。我把问题缩小到无法捕获单字和双字专有名词。

(1)如果在两次捕获之间做出选择时,我可以将条件指示默认为较长的单词,这将是很好的。

(2)如果我能告诉正则表达式只考虑字符串以prepositoin开头,例如On | At | For。我正在玩这样的东西,但它不起作用:

(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5})

我如何做1和2?

我当前的正则表达式

r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,15})'

我想捕捉,Ashoka,Shift系列,Compass合作伙伴和Kenneth Cole

#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',

3 个答案:

答案 0 :(得分:1)

您在这里尝试做的是在自然语言处理中称为“命名实体识别”。如果你真的想要一种能找到合适名词的方法,那么你可能不得不考虑加强命名实体识别。值得庆幸的是nltk库中有一些易于使用的功能:

import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)

结果:

res.productions()
Out[8]: 
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
 ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
 ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
 PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]

答案 1 :(得分:1)

不完全正确,但除了On之外,这将匹配您要查找的大部分内容。

import re
text = """
#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth     
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)

print matches

输出:

[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]

然后也许你可以实现一个过滤器来检查这个列表。

def filter_false_positive(unfiltered_matches):
    filtered_matches = []
    black_list = ["an","on","in","foo","bar"] #etc
    for match in unfiltered_matches:
        if match.lower() not in black_list:
            filtered_matches.append(match)
    return filtered_matches

或者因为python很酷:

def filter_false_positive(unfiltered_matches):
    black_list = ["an","on","in","foo","bar"] #etc
    return [match for match in filtered_matches if match.lower() not in black_list]

你可以像这样使用它:

# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches

给出最终结果:

['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']

确定一个单词是否因为在发送的开头出现而大写,或者它是否是一个专有名词的问题并不是那么重要。

'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'

在这种情况下,这是非常困难的,所以如果没有其他可以通过其他标准,黑名单,数据库等知道正确名词的东西,那就不会那么容易了。 regex很棒,但我认为它不会以任何微不足道的方式在语法层面上解释英语......

话虽如此,祝你好运!

答案 2 :(得分:1)

我会使用NLP工具,python最受欢迎的似乎是nltk。正则表达式真的不是正确的方法...在nltk网站的首页上有一个例子,链接到答案的前面,下面是复制粘贴:

import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)    
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)

实体现在包含根据the Penn treebank

标记的单词