Question

我在Perl中有一个程序，它使用正则表达式来存储支持的文件扩展名。它通过代码重用这个正则表达式。每个文件扩展名都有一个描述，因为正则表达式具有＆＃39; x＆＃39;旗。我无法弄清楚如何将它移植到python（2.7）。

原始Perl

use strict;

my @files = ('foo.abc','foo.ABC','foo.mi','foo.txt','foo.ma','foo.iff','foo.avi');

my $exts = qr/abc|mi|avi|ma|iff|tga/;

foreach my $f (sort @files) {
    if ($f =~ m/^([^.]+\.$exts)/) {
        print "file matches: $f\n";
    } 
    else {
        print "file does not match: $f\n";
    }
}

输出

file does not match: foo.ABC
file matches: foo.abc
file matches: foo.avi
file matches: foo.iff
file matches: foo.ma
file matches: foo.mi
file does not match: foo.txt

当我使用/x修饰符

添加空格时，这同样适用

$exts = qr/
    abc  (?# alembic )
    |mi  (?# mentalray )
    |avi (?# windows video )
    |ma  (?# maya ascii )
    |iff (?# amiga bitmap )
    |tga (?# targa bitmap )
/ix;

foreach my $f (sort @files) {

    if ( $f =~ m/^([^.]+\.$exts )/ ) {
        print "file matches: $f\n";
    }
    else {
        print "file does not match: $f\n";
    }
}

输出

file matches: foo.ABC
file matches: foo.abc
file matches: foo.avi
file matches: foo.iff
file matches: foo.ma
file matches: foo.mi
file does not match: foo.txt

Python支持编译的正则表达式，您可以将它们用作其他正则表达式的组件

的Python

import re

files = [ 'foo.abc','foo.ABC','foo.mi','foo.txt','foo.ma','foo.iff','foo.avi' ]

exts = re.compile(r'(?:abc|mi|avi|ma|iff|tga)')

for f in sorted(files):
    m = re.search(r'^([^.]+\.{EXTS})'.format(EXTS=exts.pattern),f)
    if m:
        print 'file matches: {0}'.format(f)
    else:
        print 'file does not match: {0}'.format(f)

输出

file does not match: foo.ABC
file matches: foo.abc
file matches: foo.avi
file matches: foo.iff
file matches: foo.ma
file matches: foo.mi
file does not match: foo.txt
'''

但是一旦我使用re.VERBOSE，正则表达式就会失败

exts = re.compile(r'''(?:
                     abc   # alembic
                    |mi    # mentalray
                    |avi   # windows video
                    |ma    # maya ascii
                    |iff   # amiga bitmap
                    |tga   # targa bitmap
                    )''', re.IGNORECASE + re.VERBOSE)

for f in sorted(files):
    m = re.search(r'^([^.]+\.{EXTS})'.format(EXTS=exts.pattern),f)
    if m:
        print 'file matches: {0}'.format(f)
    else:
        print 'file does not match: {0}'.format(f)

输出

file does not match: foo.ABC
file does not match: foo.abc
file does not match: foo.avi
file does not match: foo.iff
file does not match: foo.ma
file does not match: foo.mi
file does not match: foo.txt

我的实际代码有超过50个扩展，有关于它们的内容的评论，所以我真的想支持这个。

我搜索了所有＆＃34;嵌套的正则表达式＆＃34;我能找到的帖子，但所有这些都是字符串黑客。没有我能找到的实际正则表达式嵌套。

Python能做到吗？

Answer 1

你这样做完全错了。首先，.pattern属性只是一个字符串。所以它是100％无用的调用re.compile然后提取用于获取正则表达式对象的初始字符串以传递给re.search：

>>> regex = re.compile(r'''(
...     verbose #lol
...     | pattern  #rofl
... )
... ''', re.VERBOSE)
>>> regex.match('verbose')  # finds the match!
<_sre.SRE_Match object; span=(0, 7), match='verbose'>
>>> re.search(regex.pattern, 'verbose')  # does not find the match!
>>>

正如您所看到的，pattern属性只是用于构建正则表达式对象的初始字符串：

>>> regex.pattern
'(\n    verbose #lol\n    | pattern  #rofl\n)\n'
>>> type(regex.pattern)
<class 'str'>

因此，通过将其传递到re.search，您可以re.search 重新编译，因为re.search没有re.VERBOSE标记它用不同的含义编译它：

>>> re.search(regex.pattern, 'verbose', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 7), match='verbose'>

另外，我这样做了：

exts = [
    'abc',   # extension abc blah blah
    'cde',   # extension cde blah blah
]
exts_pattern = '(?:{})'.format('|'.join(re.escape(extension) for extension in exts))

regex = re.compile(r'^([^.]+\.{}'.format(exts_pattern), re.IGNORECASE)

或类似的。即你将各种扩展保持为list并放置你想要的任何python注释，当你使用compile构建正则表达式对象时，你会迭代它们。这样可以更轻松地添加扩展名，而且无论如何都可以使用这样的列表。

并回答你的最后一个问题：没有python re模块不支持＆＃34;正则表达式嵌套＆＃34;以任何方式。您必须提供字符串模式，该模式将编译为正则表达式对象。

Answer 2

Perl会将一个已编译的正则表达式插入到另一个中，这是一个神话。如果你写这个

log4j:WARN No appenders could be found for logger (org.apache.flink.api.scala.ClosureCleaner$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.RuntimeException: An error occurred while loading the local executor (org.apache.flink.client.LocalExecutor).
    at org.apache.flink.api.common.PlanExecutor.createLocalExecutor(PlanExecutor.java:161)
    at org.apache.flink.api.java.LocalEnvironment.startNewSession(LocalEnvironment.java:122)
    at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:81)
    at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:855)
    at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
    at org.apache.flink.api.java.DataSet.print(DataSet.java:1605)
    at org.apache.flink.api.scala.DataSet.print(DataSet.scala:1615)
    at com.sc.edl.flink.ingestion$.main(ingestion.scala:27)
    at com.sc.edl.flink.ingestion.main(ingestion.scala)
Caused by: java.lang.NoSuchMethodException: org.apache.flink.client.LocalExecutor.<init>(org.apache.flink.configuration.Configuration)
    at java.lang.Class.getConstructor0(Class.java:3082)
    at java.lang.Class.getConstructor(Class.java:1825)
    at org.apache.flink.api.common.PlanExecutor.createLocalExecutor(PlanExecutor.java:158)

然后在my $exts = qr/ abc | mi | avi | ma | iff | tga /x; if ( $f =~ /^([^.]+\.$exts)/ ) { ... }内，正则表达式模式的内容在双引号上下文中计算。这意味着Perl会将 $f =~ /^([^.]+\.$exts)/字符串化为$exts（确切的结果取决于Perl编译指示的位置）和插入之前的字符串编译模式

所以正则表达式匹配实际上就是这样做

(?^x: abc | mi | avi | ma | iff | tga )

这显然是正确的，因为在表达式

中启用了

$f =~ /^([^.]+\.(?^x: abc | mi | avi | ma | iff | tga ))/

修饰符

与Python的不同之处仅在于，而且/x对象的re或{{1}返回的内容并不那么谨慎方法，因此它们不能作为子串注入其他模式

据我所知，pattern方法只返回编译为创建对象的原始正则表达式字符串。这使得它更像是使用C __str__符号：您必须非常小心括号，无论是在原始的定义中还是在其调用中

Answer 3

是的，它可以！在关于re的Python文档中，我发现你可以在表达式中指定任何re标志 - 类似于Perl如何打印re标志内联。通过将字符串添加到字符串hack，您可以获得结果：

exts = '''(?ix)(?:
                   abc   # alembic
                  |mi    # mentalray
                  |avi   # windows video
                  |ma    # maya ascii
                  |iff   # amiga bitmap
                  |tga   # targa bitmap
               )'''

for f in sorted(files):
   m = re.search(r'^([^.]+\.{EXTS})'.format(EXTS=exts),f)
   if m:
      print 'file matches: {0}'.format(f)
   else:
      print 'file does not match: {0}'.format(f)

(?ix)是非分组的，但设置了re.IGNORECASE和re.VERBOSE。

在python

原始Perl

输出

输出

的Python

输出

输出

3 个答案: