Python正则表达式可选数字匹配返回超过预期

时间:2013-12-09 05:36:36

标签: python regex

我有一个文件列表,我正在尝试过滤以000000,060000,120000,180000结尾的文件名子集。我知道我可以进行直接字符串匹配,但我想了解为什么我在r'[00 | 06 | 12 | 18] +0000'下面尝试的正则表达式不起作用(它也返回MSM_20130519210000.csv)。我希望它与00,06,12,18中的一个相匹配,然后是0000.如何实现?请保留这个预期正则表达式的答案,而不是其他功能,谢谢。

以下是代码段:

import re

files_in_input_directory = ['MSM_20130519150000.csv', 'MSM_20130519180000.csv', 'MSM_20130519210000.csv', 
'MSM_20130520000000.csv', 'MSM_20130520030000.csv', 'MSM_20130520060000.csv', 'MSM_20130520090000.csv', 
'MSM_20130520120000.csv', 'MSM_20130520150000.csv', 'MSM_20130520180000.csv', 'MSM_20130520210000.csv', 
'MSM_20130521000000.csv', 'MSM_20130521030000.csv', 'MSM_20130521060000.csv', 'MSM_20130521090000.csv', 
'MSM_20130521120000.csv', 'MSM_20130521150000.csv', 'MSM_20130521180000.csv', 'MSM_20130521210000.csv', 
'MSM_20130522000000.csv', 'MSM_20130522030000.csv', 'MSM_20130522060000.csv', 'MSM_20130522090000.csv', 
'MSM_20130522120000.csv', 'MSM_20130522150000.csv', 'MSM_20130522180000.csv', 'MSM_20130522210000.csv', 
'MSM_20130523000000.csv', 'MSM_20130523030000.csv', 'MSM_20130523060000.csv', 'MSM_20130523090000.csv', 
'MSM_20130523120000.csv', 'MSM_20130523150000.csv', 'MSM_20130523180000.csv', 'MSM_20130523210000.csv', 
'MSM_20130524000000.csv', 'MSM_20130524030000.csv', 'MSM_20130524060000.csv', 'MSM_20130524090000.csv', 
'MSM_20130524120000.csv', 'MSM_20130524150000.csv', 'MSM_20130524180000.csv', 'MSM_20130524210000.csv', 
'MSM_20130525000000.csv', 'MSM_20130525030000.csv', 'MSM_20130525060000.csv', 'MSM_20130525090000.csv', 
'MSM_20130525120000.csv', 'MSM_20130525150000.csv', 'MSM_20130525180000.csv', 'MSM_20130525210000.csv', 
'MSM_20130526000000.csv', 'MSM_20130526030000.csv', 'MSM_20130526060000.csv', 'MSM_20130526090000.csv', 
'MSM_20130526120000.csv', 'MSM_20130526150000.csv', 'MSM_20130526180000.csv', 'MSM_20130526210000.csv', 
'MSM_20130527000000.csv', 'MSM_20130527030000.csv', 'MSM_20130527060000.csv', 'MSM_20130527090000.csv', 
'MSM_20130527120000.csv', 'MSM_20130527150000.csv', 'MSM_20130527180000.csv', 'MSM_20130527210000.csv', 
'MSM_20130528000000.csv', 'MSM_20130528030000.csv', 'MSM_20130528060000.csv', 'MSM_20130528090000.csv', 
'MSM_20130528120000.csv', 'MSM_20130528150000.csv', 'MSM_20130528180000.csv', 'MSM_20130528210000.csv', 
'MSM_20130529000000.csv', 'MSM_20130529030000.csv', 'MSM_20130529060000.csv', 'MSM_20130529090000.csv']

print files_in_input_directory
print "\n"

# trying to match any string with 000000, 060000, 120000, 180000
# Question: I use + meaning one or more, and | to indicates the options, but this will match
# 'MSM_20130519210000.csv' as well, and I don't know why
print filter(lambda x:re.search(r'[00|06|12|18]+0000', x), files_in_input_directory)
print "\n"

# This verbose version works
print filter(lambda x:re.search(r'0000000|060000|120000|180000', x), files_in_input_directory)
print "\n"

3 个答案:

答案 0 :(得分:1)

如果您尝试匹配包含000000060000120000180000的文件名,则代替

re.search(r'[00|06|12|18]+0000', x)

使用

re.search(r'(00|06|12|18)0000', x)

方括号[...]一次只匹配一个字符,+字符表示“匹配前一个表达式的1 或更多”。

答案 1 :(得分:0)

[00|06|12|18]是匹配00|06|12|18字符集。因此它将匹配{34}中的210000 SM_20130519210000.csv"因为[00|06|12|18]等同于写作[01268]。不是你的意思,我应该想。

不是表达可以匹配一次或多次的字符集,而是将其设为捕获组

r'(00|06|12|18)0000'

或负面的背后表达

r'(?<=00|06|12|18)0000'

它们与您的目的相同,因为您不关心比赛或任何组。

答案 2 :(得分:0)

这里的基本问题是你没有对模式进行分组,而是创建一个与使用``[...]```匹配的字符集。

此正则表达式有效:((000)|(06)|(12)|(18))0000