我正在尝试从字符串中提取单词形式的数字。例如,输入字符串可能类似于:
"What is 3 million 6 hundred 5 divided by 5 hundred?"
从这个输入中,我想弄清楚如何将这两个数字作为群组。
["3 million 6 hundred 5", "5 hundred"]
注意:在其他输入字符串中可能会找到更多数字。
我相信正则表达式是解决这个问题的正确途径。理想情况下,我可以传递一个比例列表,如:
["hundred", "thousand", "million", "billion", ...]
到目前为止,这就是我所拥有的:
scales= ["hundred", "thousand", "million", "billion"]
scale_pattern = '|'.join(scales)
regex = re.compile('\b(d+' + scale_pattern + 'd+)+\b')
我知道我的模式不太正确,我想要的psudeocode是:
for any number of the following occurrences:
find the pattern [int word_from_list optional_int]
答案 0 :(得分:3)
理想情况下,我可以传递一个比例列表
你可以像这样在非捕获或捕获组中传递它们。
正则表达式: printMe(){
window.print();
}
上面是一个简单的正则表达式检查数字(?:\d+\s(?:million|hundred|thousand|billion)*\s*)+
后跟空格\d+
和比例,可选(使用{{ 1}}量词)最后一个数字后跟可选的空格。整个模式重复一次或多次次(使用\s
量词)。
<强> Regex101 Demo 强>
答案 1 :(得分:0)
嗯,下面是一个较差的解析器。
# you should expand these lists later...
units = ["hundred", "thousand", "million", "billion"]
operations = ['divided', 'multiplied']
delims = ['by', 'with']
discards = ['?', '!', '.']
sentence = 'What is 3 million 6 hundred 5 divided by 5 hundred?'
filterd_sentence = sentence
for t in discards:
filterd_sentence = filterd_sentence.replace(t, '')
filterd_t = []
buffer = ''
for t in filterd_sentence.split(' '):
if t.isnumeric() or t in units:
buffer += t + ' '
elif t in operations or t in delims:
if buffer != '':
filterd_t.append(buffer[:len(buffer)-1])
buffer = ''
if buffer != '':
filterd_t.append(buffer[:len(buffer)-1])
print(filterd_t)
# ['3 million 6 hundred 5', '5 hundred']