如何使用正则表达式在python中提取关键字列表后的单词?

时间:2017-10-12 07:16:08

标签: python regex

我正在尝试使用python中的Regex提取位置。 现在我这样做:

def get_location(s):
    s = s.strip(STRIP_CHARS)
    keywords = "at|outside|near"
    location_pattern = "(?P<location>((?P<place>{keywords}\s[A-Za-z]+)))".format(keywords = keywords)
    location_regex = re.compile(location_pattern, re.IGNORECASE | re.MULTILINE | re.UNICODE | re.DOTALL | re.VERBOSE)

    for match in location_regex.finditer(s):
        match_str = match.group(0)
        indices = match.span(0)
        print ("Match", match)
        match_str = match.group(0)
        indices = match.span(0)
        print (match_str)

get_location("Im at building 3")

我有三个问题:

  1. 它只是在&#34; at&#34;作为输出但它也应该给予建设。
  2. captures = match.capturesdict()我无法用它来提取其他示例中的捕获。
  3. 当我这样做时location_pattern = 'at|outside\s\w+。它似乎工作。有人能解释我做错了吗?

1 个答案:

答案 0 :(得分:1)

此处的主要问题是您需要将{keywords}放入非捕获组:(?:{keywords})。以下是一个示意图示例:a|b|c\s+\w+匹配abc + <whitespace(s)> + . When you put the alternation list into a group,(a | b | c)\ s + \ w + , it matches either一个, or b or c`然后它才会尝试匹配空格,然后匹配单词字符。

查看更新的代码(demo online):

import regex as re
def get_location(s):
    STRIP_CHARS = '*'
    s = s.strip(STRIP_CHARS)
    keywords = "at|outside|near"
    location_pattern = "(?P<location>((?P<place>(?:{keywords})\s+[A-Za-z]+)))".format(keywords = keywords)
    location_regex = re.compile(location_pattern, re.IGNORECASE | re.UNICODE)

    for match in location_regex.finditer(s):
        match_str = match.group(0)
        indices = match.span(0)
        print ("Match", match)
        match_str = match.group(0)
        indices = match.span(0)
        print (match_str)
        captures = match.capturesdict()
        print(captures)

get_location("Im at building 3")

输出:

('Match', <regex.Match object; span=(3, 14), match='at building'>)
at building
{'place': ['at building'], 'location': ['at building']}

请注意location_pattern = 'at|outside\s\w+无法正常工作,因为at在任何地方都匹配,outside必须跟随空格和字词。您可以采用相同的方式修复它:(at|outside)\s\w+

如果您将关键字放入一个组中,the captures = match.capturesdict()将会正常运行(请参阅上面的输出)。