正则表达式根据前面的文本捕获数字

时间:2014-11-21 14:09:44

标签: python regex regex-lookarounds

考虑以下文字:

one="ambience: 5 comments:xxx food: 4 comments: xxxx service: 3 
comments: xxx" 

two="ambience: 5 comments:xxx food:   comments: since nothing to eat
after 8 pm service: 4  comments: xxxx "

three="ambience: it is a 5 comments:xxx food: a 6   comments: since nothing to eat
after 8 pm service: a 4  comments: xxxx "

表示字符串

    re.findall(ur'(ambience|food|service)[\s\S]*?(\d)',one,re.UNICODE)
    [('ambience', '5'), ('food', '4'), ('service', '3')]

对于字符串2,结果是

[('ambience', '5'), ('food', '8'), ('service', '4')]

因为这个逻辑纯粹是在特定文本之后寻找第一个数字,所以当有意或无意地跳过评级时,这是相当误导的。

如果错过了连续评级,我如何获得正则表达式将评级恢复为NaN?

[('ambience', '5'), ('food', 'NaN'), ('service', '4')]

我还有一个使用前瞻和后视锚点的变体

re.findall(ur'(?<=food)[\s]*:[^\d]*([\d[.|-|\/|-]+)[^\d]*(?=comment[s]*[\s]*:)',one,re.UNICODE)

1 个答案:

答案 0 :(得分:1)

正则表达式的一个简单改变就是诀窍

(ambience|food|service):[^\d:]*(\d*)
  • [^\d:]*匹配:或数字
  • 以外的任何内容

匹配http://regex101.com/r/bM0gT2/1

的示例

使用示例

>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', one)
[('ambience', '5'), ('food', '4'), ('service', '3')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', two)
[('ambience', '5'), ('food', ''), ('service', '4')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', three)
[('ambience', '5'), ('food', '6'), ('service', '4')]