如何使python重新处理连续匹配模式选项是模式的单个出现?

时间:2017-07-12 23:21:38

标签: python regex

我正在开发一个聊天机器人,我希望在其中嵌入某些规则。其中一个就是解析这样的问题:

"多少是一万二千三百四十五加二百五十六?" 要么 "什么是589除以89?"

我有以下代码:

import re

pat_num = re.compile(r'((\b(zero|one|two|three|four|five|'
                     r'six|seven|eight|nine|ten|eleven|'
                     r'twelve|thirteen|fourteen|fifteen|sixteen|'
                     r'seventeen|eighteen|nineteen|twenty|thirty|'
                     r'forty|fifty|sixty|seventy|eighty|'
                     r'ninety|hundred|thousand|million|billion|'
                     r'trillion)\b)+|\d+)')
ind_list = [(m.start(0), m.end(0)) for m in re.finditer(pat_num, sentence)]

我希望两个句子都能返回两个数字。例如,对于第一句,它应该返回数字的索引:一万二千三百四十二和二百五十六。

然而,它返回第一个的9个数字/匹配,分别是:十二,千,三,百,四,二,一百,五十,六。

如何更改正则表达式以使其返回2个数字?

非常感谢您的帮助!

2 个答案:

答案 0 :(得分:2)

如果您想要获得实际的索引而不是匹配的文本本身,那么它应该是一点点前瞻:

# easier to manage as a list
numerals = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine",
            "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen", "twenty", "thirty", "fourty", "fifty",
            "sixty", "seventy", "eighty", "ninety", "hundred", "thousand", "million",
            "billion", "trillion"]

pattern = re.compile(r"((({})\s*)+)(?=\s|$)|\d+".format("|".join(numerals)))  # all together

然后您可以将其测试为:

sentence = "How much is twelve thousand three hundred four plus two hundred fifty six?"
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, sentence)])
# [(12, 46), (52, 69)]

sentence = "What is five hundred eighty nine divided by 89?"
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, sentence)])
# [(8, 32), (44, 46)]

答案 1 :(得分:1)

使用块<numeral>(?:[\s-]<numeral>)*构建数字模式,该块将匹配数字,然后是空格的任何0+序列或-后跟数字。

import re
numeral_rx = r'(?:zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand|million|billion|trillion)'
sentences=["How much is twelve thousand three hundred four plus two hundred fifty six?",
"How much is twelve thousand three hundred and four divided by two hundred fifty-six?"]
pat_num = re.compile(r'\b{0}(?:(?:\s+(?:and\s+)?|-){0})*\b|\d+'.format(numeral_rx))
for sentence in sentences:
    print(re.findall(pat_num, sentence))
# => ['twelve thousand three hundred four', 'two hundred fifty six']
#    ['twelve thousand three hundred and four', 'two hundred fifty-six']

请参阅Python demo

请注意,由于非捕获群组(?:...),简单的re.findall调用足以获得所有匹配。

<强>详情:

  • \b - 字边界
  • {0} - 包含数字字符串占位符的备用组
  • (?:(?:\s+(?:and\s+)?|-){0})* - 0个或更多序列:
    • (?:\s+(?:and\s+)?|-) - 两种选择中的任何一种:
      • \s+(?:and\s+)? - 1+个空格后跟1个或0个and个子串和1个以上的空格
      • | - 或
      • - - 连字符
    • {0} - 包含数字字符串占位符的备用组
  • \b - 字边界
  • | - 或
  • \d+ - 1+位。