Question

我正在开发一个聊天机器人，我希望在其中嵌入某些规则。其中一个就是解析这样的问题：

＆＃34;多少是一万二千三百四十五加二百五十六？＆＃34; 要么＆＃34;什么是589除以89？＆＃34;

我有以下代码：

import re

pat_num = re.compile(r'((\b(zero|one|two|three|four|five|'
                     r'six|seven|eight|nine|ten|eleven|'
                     r'twelve|thirteen|fourteen|fifteen|sixteen|'
                     r'seventeen|eighteen|nineteen|twenty|thirty|'
                     r'forty|fifty|sixty|seventy|eighty|'
                     r'ninety|hundred|thousand|million|billion|'
                     r'trillion)\b)+|\d+)')
ind_list = [(m.start(0), m.end(0)) for m in re.finditer(pat_num, sentence)]

我希望两个句子都能返回两个数字。例如，对于第一句，它应该返回数字的索引：一万二千三百四十二和二百五十六。

然而，它返回第一个的9个数字/匹配，分别是：十二，千，三，百，四，二，一百，五十，六。

如何更改正则表达式以使其返回2个数字？

非常感谢您的帮助！

Answer 1

如果您想要获得实际的索引而不是匹配的文本本身，那么它应该是一点点前瞻：

# easier to manage as a list
numerals = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine",
            "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen", "twenty", "thirty", "fourty", "fifty",
            "sixty", "seventy", "eighty", "ninety", "hundred", "thousand", "million",
            "billion", "trillion"]

pattern = re.compile(r"((({})\s*)+)(?=\s|$)|\d+".format("|".join(numerals)))  # all together

然后您可以将其测试为：

sentence = "How much is twelve thousand three hundred four plus two hundred fifty six?"
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, sentence)])
# [(12, 46), (52, 69)]

sentence = "What is five hundred eighty nine divided by 89?"
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, sentence)])
# [(8, 32), (44, 46)]

Answer 2

使用块<numeral>(?:[\s-]<numeral>)*构建数字模式，该块将匹配数字，然后是空格的任何0+序列或-后跟数字。

import re
numeral_rx = r'(?:zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand|million|billion|trillion)'
sentences=["How much is twelve thousand three hundred four plus two hundred fifty six?",
"How much is twelve thousand three hundred and four divided by two hundred fifty-six?"]
pat_num = re.compile(r'\b{0}(?:(?:\s+(?:and\s+)?|-){0})*\b|\d+'.format(numeral_rx))
for sentence in sentences:
    print(re.findall(pat_num, sentence))
# => ['twelve thousand three hundred four', 'two hundred fifty six']
#    ['twelve thousand three hundred and four', 'two hundred fifty-six']

请参阅Python demo。

请注意，由于非捕获群组(?:...)，简单的re.findall调用足以获得所有匹配。

<强>详情：

\b - 字边界
{0} - 包含数字字符串占位符的备用组
(?:(?:\s+(?:and\s+)?|-){0})* - 0个或更多序列：
- (?:\s+(?:and\s+)?|-) - 两种选择中的任何一种：
  - \s+(?:and\s+)? - 1+个空格后跟1个或0个and个子串和1个以上的空格
  - | - 或
  - - - 连字符
- {0} - 包含数字字符串占位符的备用组
\b - 字边界
| - 或
\d+ - 1+位。

如何使python重新处理连续匹配模式选项是模式的单个出现？

2 个答案: