正则表达式以单词形式查找大数字

时间:2017-04-01 12:30:52

标签: python regex

我正在尝试从字符串中提取单词形式的数字。例如,输入字符串可能类似于:

"What is 3 million 6 hundred 5 divided by 5 hundred?"

从这个输入中,我想弄清楚如何将这两个数字作为群组。

["3 million 6 hundred 5", "5 hundred"]

注意:在其他输入字符串中可能会找到更多数字。

我相信正则表达式是解决这个问题的正确途径。理想情况下,我可以传递一个比例列表,如:

["hundred", "thousand", "million", "billion", ...]

到目前为止,这就是我所拥有的:

scales= ["hundred", "thousand", "million", "billion"]
scale_pattern = '|'.join(scales)
regex = re.compile('\b(d+' + scale_pattern + 'd+)+\b')

我知道我的模式不太正确,我想要的psudeocode是:

for any number of the following occurrences:
    find the pattern [int word_from_list optional_int]

2 个答案:

答案 0 :(得分:3)

  

理想情况下,我可以传递一个比例列表

你可以像这样在非捕获或捕获组中传递它们。

正则表达式: printMe(){ window.print(); }

上面是一个简单的正则表达式检查数字(?:\d+\s(?:million|hundred|thousand|billion)*\s*)+后跟空格\d+比例可选(使用{{ 1}}量词)最后一个数字后跟可选的空格。整个模式重复一次或多次次(使用\s量词)。

<强> Regex101 Demo

答案 1 :(得分:0)

嗯,下面是一个较差的解析器。

# you should expand these lists later...
units = ["hundred", "thousand", "million", "billion"]
operations = ['divided', 'multiplied']
delims = ['by', 'with']
discards = ['?', '!', '.']

sentence = 'What is 3 million 6 hundred 5 divided by 5 hundred?'

filterd_sentence = sentence
for t in discards:
    filterd_sentence = filterd_sentence.replace(t, '')

filterd_t = []
buffer = ''
for t in filterd_sentence.split(' '):
    if t.isnumeric() or t in units:
        buffer += t + ' '
    elif t in operations or t in delims:
        if buffer != '':
            filterd_t.append(buffer[:len(buffer)-1])
            buffer = ''

if buffer != '':
    filterd_t.append(buffer[:len(buffer)-1])

print(filterd_t)
# ['3 million 6 hundred 5', '5 hundred']