Question

mato_grosso_2000_test.csv
mato_grosso_do_sul_2000_test.csv

我想在python中构建一个可以区分上面两行的正则表达式。 _20xx_test.csv始终存在于文件名中，其中xx的范围可以从00到17我该怎么做？

我尝试了一个简单的fnmatch，但无法区分两者，即mato_grosso也找到了mato_grosso_do_sul

编辑：

我希望reg ex测试选择mato_grosso_2000_test.csv和mato_grosso_2001_test.csv但不是mato_grosso_do_sul_2000_test.csv

Answer 1

您可以使用具有负前瞻断言的正则表达式来查找“mato_grosso”的匹配，而不是“do_sul”。例如：

re.match('mato_grosso_(?!do_sul)', 'mato_grosso_2000_test.csv')

re.match('mato_grosso_(?!do_sul)', 'mato_grosso_do_sul_2000_test.csv')

这会找到第一个示例的匹配项，但不匹配第二个示例。

Python re module文档对正则表达式语法进行了更多讨论。如果想了解更多有关详细信息，请查找“负向前瞻”。

Answer 2

我认为你真正追求的是更像这样的东西：

regions_to_files = defaultdict(list)
for x in filenames:
    matches = re.match(r'(?P<region>.*)_(?P<year>200[0-9]|201[0-7])_test.csv', x)
    region = matches.group('region')
    regions_to_files[region].append(x)

现在，所有与mato_grosso相关的文件都将在regions_to_files['mato_grosso']上提供，而与mato_grosso_do_sul相关的所有文件都可在regions_to_files['mato_grosso_do_sul']

获取

匹配第一个文件名：

# mato_grosso_2000_test.csv
re.match(r'mato_grosso_20(0[0-9]|1[0-7])_test.csv', filename)

匹配第二个文件名但不匹配第一个：

# mato_grosso_do_sul_2000_test.csv
re.match(r'mato_grosso_do_sul_20(0[0-9]|1[0-7])_test.csv', filename)

正则表达式(0[0-9]|1[0-7])将匹配00,01 ,. 。。，17适合你。

在python中使用正则表达式区分2个字符串

2 个答案: