我正在开发一个聊天机器人,我希望在其中嵌入某些规则。其中一个就是解析这样的问题:
"多少是一万二千三百四十五加二百五十六?" 要么 "什么是589除以89?"
我有以下代码:
import re
pat_num = re.compile(r'((\b(zero|one|two|three|four|five|'
r'six|seven|eight|nine|ten|eleven|'
r'twelve|thirteen|fourteen|fifteen|sixteen|'
r'seventeen|eighteen|nineteen|twenty|thirty|'
r'forty|fifty|sixty|seventy|eighty|'
r'ninety|hundred|thousand|million|billion|'
r'trillion)\b)+|\d+)')
ind_list = [(m.start(0), m.end(0)) for m in re.finditer(pat_num, sentence)]
我希望两个句子都能返回两个数字。例如,对于第一句,它应该返回数字的索引:一万二千三百四十二和二百五十六。
然而,它返回第一个的9个数字/匹配,分别是:十二,千,三,百,四,二,一百,五十,六。
如何更改正则表达式以使其返回2个数字?
非常感谢您的帮助!
答案 0 :(得分:2)
如果您想要获得实际的索引而不是匹配的文本本身,那么它应该是一点点前瞻:
# easier to manage as a list
numerals = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine",
"ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
"seventeen", "eighteen", "nineteen", "twenty", "thirty", "fourty", "fifty",
"sixty", "seventy", "eighty", "ninety", "hundred", "thousand", "million",
"billion", "trillion"]
pattern = re.compile(r"((({})\s*)+)(?=\s|$)|\d+".format("|".join(numerals))) # all together
然后您可以将其测试为:
sentence = "How much is twelve thousand three hundred four plus two hundred fifty six?"
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, sentence)])
# [(12, 46), (52, 69)]
sentence = "What is five hundred eighty nine divided by 89?"
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, sentence)])
# [(8, 32), (44, 46)]
答案 1 :(得分:1)
使用块<numeral>(?:[\s-]<numeral>)*
构建数字模式,该块将匹配数字,然后是空格的任何0+序列或-
后跟数字。
import re
numeral_rx = r'(?:zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand|million|billion|trillion)'
sentences=["How much is twelve thousand three hundred four plus two hundred fifty six?",
"How much is twelve thousand three hundred and four divided by two hundred fifty-six?"]
pat_num = re.compile(r'\b{0}(?:(?:\s+(?:and\s+)?|-){0})*\b|\d+'.format(numeral_rx))
for sentence in sentences:
print(re.findall(pat_num, sentence))
# => ['twelve thousand three hundred four', 'two hundred fifty six']
# ['twelve thousand three hundred and four', 'two hundred fifty-six']
请参阅Python demo。
请注意,由于非捕获群组(?:...)
,简单的re.findall
调用足以获得所有匹配。
<强>详情:
\b
- 字边界{0}
- 包含数字字符串占位符的备用组(?:(?:\s+(?:and\s+)?|-){0})*
- 0个或更多序列:
(?:\s+(?:and\s+)?|-)
- 两种选择中的任何一种:
\s+(?:and\s+)?
- 1+个空格后跟1个或0个and
个子串和1个以上的空格|
- 或-
- 连字符{0}
- 包含数字字符串占位符的备用组\b
- 字边界|
- 或\d+
- 1+位。