考虑以下文字:
one="ambience: 5 comments:xxx food: 4 comments: xxxx service: 3
comments: xxx"
two="ambience: 5 comments:xxx food: comments: since nothing to eat
after 8 pm service: 4 comments: xxxx "
three="ambience: it is a 5 comments:xxx food: a 6 comments: since nothing to eat
after 8 pm service: a 4 comments: xxxx "
表示字符串
re.findall(ur'(ambience|food|service)[\s\S]*?(\d)',one,re.UNICODE)
[('ambience', '5'), ('food', '4'), ('service', '3')]
对于字符串2,结果是
[('ambience', '5'), ('food', '8'), ('service', '4')]
因为这个逻辑纯粹是在特定文本之后寻找第一个数字,所以当有意或无意地跳过评级时,这是相当误导的。
如果错过了连续评级,我如何获得正则表达式将评级恢复为NaN?
[('ambience', '5'), ('food', 'NaN'), ('service', '4')]
我还有一个使用前瞻和后视锚点的变体
re.findall(ur'(?<=food)[\s]*:[^\d]*([\d[.|-|\/|-]+)[^\d]*(?=comment[s]*[\s]*:)',one,re.UNICODE)
答案 0 :(得分:1)
正则表达式的一个简单改变就是诀窍
(ambience|food|service):[^\d:]*(\d*)
[^\d:]*
匹配:
或数字匹配http://regex101.com/r/bM0gT2/1
的示例使用示例
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', one)
[('ambience', '5'), ('food', '4'), ('service', '3')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', two)
[('ambience', '5'), ('food', ''), ('service', '4')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', three)
[('ambience', '5'), ('food', '6'), ('service', '4')]