你如何将这个正则表达式的习惯用法从Perl翻译成Python?

时间:2008-09-23 16:55:19

标签: python regex perl

我大约一年前从Perl切换到Python,并没有回头。我发现只有一个成语在Perl中比在Python中更容易做到:

if ($var =~ /foo(.+)/) {
  # do something with $1
} elsif ($var =~ /bar(.+)/) {
  # do something with $1
} elsif ($var =~ /baz(.+)/) {
  # do something with $1
}

相应的Python代码并不那么优雅,因为if语句不断嵌套:

m = re.search(r'foo(.+)', var)
if m:
  # do something with m.group(1)
else:
  m = re.search(r'bar(.+)', var)
  if m:
    # do something with m.group(1)
  else:
    m = re.search(r'baz(.+)', var)
    if m:
      # do something with m.group(2)

有没有人有一种优雅的方式在Python中重现这种模式?我已经看过使用过的匿名函数调度表,但对于少数正则表达式来说,这些对我来说似乎有点笨拙......

15 个答案:

答案 0 :(得分:18)

使用命名组和调度表:

r = re.compile(r'(?P<cmd>foo|bar|baz)(?P<data>.+)')

def do_foo(data):
    ...

def do_bar(data):
    ...

def do_baz(data):
    ...

dispatch = {
    'foo': do_foo,
    'bar': do_bar,
    'baz': do_baz,
}


m = r.match(var)
if m:
    dispatch[m.group('cmd')](m.group('data'))

通过一些内省,您可以自动生成正则表达式和调度表。

答案 1 :(得分:10)

是的,这有点烦人。也许这适用于您的情况。


import re

class ReCheck(object):
    def __init__(self):
        self.result = None
    def check(self, pattern, text):
        self.result = re.search(pattern, text)
        return self.result

var = 'bar stuff'
m = ReCheck()
if m.check(r'foo(.+)',var):
    print m.result.group(1)
elif m.check(r'bar(.+)',var):
    print m.result.group(1)
elif m.check(r'baz(.+)',var):
    print m.result.group(1)

编辑: Brian正确指出我的第一次尝试无效。不幸的是,这种尝试时间更长。

答案 2 :(得分:10)

r"""
This is an extension of the re module. It stores the last successful
match object and lets you access it's methods and attributes via
this module.

This module exports the following additional functions:
    expand  Return the string obtained by doing backslash substitution on a
            template string.
    group   Returns one or more subgroups of the match.
    groups  Return a tuple containing all the subgroups of the match.
    start   Return the indices of the start of the substring matched by
            group.
    end     Return the indices of the end of the substring matched by group.
    span    Returns a 2-tuple of (start(), end()) of the substring matched
            by group.

This module defines the following additional public attributes:
    pos         The value of pos which was passed to the search() or match()
                method.
    endpos      The value of endpos which was passed to the search() or
                match() method.
    lastindex   The integer index of the last matched capturing group.
    lastgroup   The name of the last matched capturing group.
    re          The regular expression object which as passed to search() or
                match().
    string      The string passed to match() or search().
"""

import re as re_

from re import *
from functools import wraps

__all__ = re_.__all__ + [ "expand", "group", "groups", "start", "end", "span",
        "last_match", "pos", "endpos", "lastindex", "lastgroup", "re", "string" ]

last_match = pos = endpos = lastindex = lastgroup = re = string = None

def _set_match(match=None):
    global last_match, pos, endpos, lastindex, lastgroup, re, string
    if match is not None:
        last_match = match
        pos = match.pos
        endpos = match.endpos
        lastindex = match.lastindex
        lastgroup = match.lastgroup
        re = match.re
        string = match.string
    return match

@wraps(re_.match)
def match(pattern, string, flags=0):
    return _set_match(re_.match(pattern, string, flags))


@wraps(re_.search)
def search(pattern, string, flags=0):
    return _set_match(re_.search(pattern, string, flags))

@wraps(re_.findall)
def findall(pattern, string, flags=0):
    matches = re_.findall(pattern, string, flags)
    if matches:
        _set_match(matches[-1])
    return matches

@wraps(re_.finditer)
def finditer(pattern, string, flags=0):
    for match in re_.finditer(pattern, string, flags):
        yield _set_match(match)

def expand(template):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.expand(template)

def group(*indices):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.group(*indices)

def groups(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groups(default)

def groupdict(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groupdict(default)

def start(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.start(group)

def end(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.end(group)

def span(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.span(group)

del wraps  # Not needed past module compilation

例如:

if gre.match("foo(.+)", var):
  # do something with gre.group(1)
elif gre.match("bar(.+)", var):
  # do something with gre.group(1)
elif gre.match("baz(.+)", var):
  # do something with gre.group(1)

答案 3 :(得分:9)

我建议这样做,因为它使用最少的正则表达式来实现你的目标。它仍然是功能代码,但不会比旧的Perl差。

import re
var = "barbazfoo"

m = re.search(r'(foo|bar|baz)(.+)', var)
if m.group(1) == 'foo':
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "bar":
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "baz":
    print m.group(2)
    # do something with m.group(2)

答案 4 :(得分:6)

感谢this other SO question

import re

class DataHolder:
    def __init__(self, value=None, attr_name='value'):
        self._attr_name = attr_name
        self.set(value)
    def __call__(self, value):
        return self.set(value)
    def set(self, value):
        setattr(self, self._attr_name, value)
        return value
    def get(self):
        return getattr(self, self._attr_name)

string = u'test bar 123'
save_match = DataHolder(attr_name='match')
if save_match(re.search('foo (\d+)', string)):
    print "Foo"
    print save_match.match.group(1)
elif save_match(re.search('bar (\d+)', string)):
    print "Bar"
    print save_match.match.group(1)
elif save_match(re.search('baz (\d+)', string)):
    print "Baz"
    print save_match.match.group(1)

答案 5 :(得分:4)

或者,根本不使用正则表达式:

prefix, data = var[:3], var[3:]
if prefix == 'foo':
    # do something with data
elif prefix == 'bar':
    # do something with data
elif prefix == 'baz':
    # do something with data
else:
    # do something with var

这是否合适取决于您的实际问题。不要忘记,正则表达式不是他们在Perl中的瑞士军刀; Python有不同的构造来进行字符串操作。

答案 6 :(得分:4)

def find_first_match(string, *regexes):
    for regex, handler in regexes:
        m = re.search(regex, string):
        if m:
            handler(m)
            return
    else:
        raise ValueError

find_first_match(
    foo, 
    (r'foo(.+)', handle_foo), 
    (r'bar(.+)', handle_bar), 
    (r'baz(.+)', handle_baz))

为了加快速度,可以在内部将所有正则表达式转换为一个并立即创建调度程序。理想情况下,这将变成一个类。

答案 7 :(得分:3)

以下是解决此问题的方法:

matched = False;

m = re.match("regex1");
if not matched and m:
    #do something
    matched = True;

m = re.match("regex2");
if not matched and m:
    #do something else
    matched = True;

m = re.match("regex3");
if not matched and m:
    #do yet something else
    matched = True;

不像原始图案那么干净。但是,它简单,直接,不需要额外的模块或更改原始正则表达式。

答案 8 :(得分:2)

Python 3.8开始并引入assignment expressions (PEP 572):=运算符),我们现在可以按顺序捕获变量re.search(pattern, text)中的条件值match都要检查它是否不是None,然后在条件主体内重新使用它:

if match := re.search(r'foo(.+)', text):
  # do something with match.group(1)
elif match := re.search(r'bar(.+)', text):
  # do something with match.group(1)
elif match := re.search(r'baz(.+)', text)
  # do something with match.group(2)

答案 9 :(得分:1)

如何使用字典?

match_objects = {}

if match_objects.setdefault( 'mo_foo', re_foo.search( text ) ):
  # do something with match_objects[ 'mo_foo' ]

elif match_objects.setdefault( 'mo_bar', re_bar.search( text ) ):
  # do something with match_objects[ 'mo_bar' ]

elif match_objects.setdefault( 'mo_baz', re_baz.search( text ) ):
  # do something with match_objects[ 'mo_baz' ]

...

但是,您必须确保没有重复的match_objects字典键(mo_foo,mo_bar,...),最好通过为每个正则表达式指定自己的名称并相应地命名match_objects键,否则match_objects.setdefault()方法将返回现有的匹配对象,而不是通过运行re_xxx.search(text)来创建新的匹配对象。

答案 10 :(得分:1)

稍微扩展Pat Notz的解决方案,我发现它甚至更优雅:
- 将方法命名为re提供的方法(例如search()check())和
- 在持有者对象本身上实现必要的方法,如group()

class Re(object):
    def __init__(self):
        self.result = None

    def search(self, pattern, text):
        self.result = re.search(pattern, text)
        return self.result

    def group(self, index):
        return self.result.group(index)

实施例

而不是例如这样:

m = re.search(r'set ([^ ]+) to ([^ ]+)', line)
if m:
    vars[m.group(1)] = m.group(2)
else:
    m = re.search(r'print ([^ ]+)', line)
    if m:
        print(vars[m.group(1)])
    else:
        m = re.search(r'add ([^ ]+) to ([^ ]+)', line)
        if m:
            vars[m.group(2)] += vars[m.group(1)]

一个人这样做:

m = Re()
...
if m.search(r'set ([^ ]+) to ([^ ]+)', line):
    vars[m.group(1)] = m.group(2)
elif m.search(r'print ([^ ]+)', line):
    print(vars[m.group(1)])
elif m.search(r'add ([^ ]+) to ([^ ]+)', line):
    vars[m.group(2)] += vars[m.group(1)]

最终看起来非常自然,从Perl移动时不需要太多的代码更改,并像其他解决方案一样避免了全局状态的问题。

答案 11 :(得分:1)

极简主义DataHolder:

class Holder(object):
    def __call__(self, *x):
        if x:
            self.x = x[0]
        return self.x

data = Holder()

if data(re.search('foo (\d+)', string)):
    print data().group(1)

或作为单身功能:

def data(*x):
    if x:
        data.x = x[0]
    return data.x

答案 12 :(得分:0)

我的解决方案是:

import re

class Found(Exception): pass

try:        
    for m in re.finditer('bar(.+)', var):
        # Do something
        raise Found

    for m in re.finditer('foo(.+)', var):
        # Do something else
        raise Found

except Found: pass

答案 13 :(得分:0)

这是一个RegexDispatcher类,它通过正则表达式调度其子类方法。

每个可调度方法都使用正则表达式进行注释,例如

def plus(self, regex: r"\+", **kwargs):
...

在这种情况下,注释称为“正则表达式”,其值是要匹配的正则表达式,'\ +',即+符号。这些带注释的方法放在子类中,而不是放在基类中。

当在字符串上调用dispatch(...)方法时,该类会找到带有与字符串匹配的注释正则表达式并调用它的方法。这是班级:

import inspect
import re


class RegexMethod:
    def __init__(self, method, annotation):
        self.method = method
        self.name = self.method.__name__
        self.order = inspect.getsourcelines(self.method)[1] # The line in the source file
        self.regex = self.method.__annotations__[annotation]

    def match(self, s):
        return re.match(self.regex, s)

    # Make it callable
    def __call__(self, *args, **kwargs):
        return self.method(*args, **kwargs)

    def __str__(self):
        return str.format("Line: %s, method name: %s, regex: %s" % (self.order, self.name, self.regex))


class RegexDispatcher:
    def __init__(self, annotation="regex"):
        self.annotation = annotation
        # Collect all the methods that have an annotation that matches self.annotation
        # For example, methods that have the annotation "regex", which is the default
        self.dispatchMethods = [RegexMethod(m[1], self.annotation) for m in
                                inspect.getmembers(self, predicate=inspect.ismethod) if
                                (self.annotation in m[1].__annotations__)]
        # Be sure to process the dispatch methods in the order they appear in the class!
        # This is because the order in which you test regexes is important.
        # The most specific patterns must always be tested BEFORE more general ones
        # otherwise they will never match.
        self.dispatchMethods.sort(key=lambda m: m.order)

    # Finds the FIRST match of s against a RegexMethod in dispatchMethods, calls the RegexMethod and returns
    def dispatch(self, s, **kwargs):
        for m in self.dispatchMethods:
            if m.match(s):
                return m(self.annotation, **kwargs)
        return None

要使用此类,请将其子类化以创建带有带注释方法的类。举例来说,这是一个简单的RPNCalculator,它继承自RegexDispatcher。要调度的方法(当然)是带有“正则表达式”注释的方法。父调度()方法在调用

中调用
from RegexDispatcher import *
import math

class RPNCalculator(RegexDispatcher):
    def __init__(self):
        RegexDispatcher.__init__(self)
        self.stack = []

    def __str__(self):
        return str(self.stack)

    # Make RPNCalculator objects callable
    def __call__(self, expression):
        # Calculate the value of expression
        for t in expression.split():
            self.dispatch(t, token=t)
        return self.top()  # return the top of the stack

    # Stack management
    def top(self):
        return self.stack[-1] if len(self.stack) > 0 else []

    def push(self, x):
        return self.stack.append(float(x))

    def pop(self, n=1):
        return self.stack.pop() if n == 1 else [self.stack.pop() for n in range(n)]

    # Handle numbers
    def number(self, regex: r"[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", **kwargs):
        self.stack.append(float(kwargs['token']))

    # Binary operators
    def plus(self, regex: r"\+", **kwargs):
        a, b = self.pop(2)
        self.push(b + a)

    def minus(self, regex: r"\-", **kwargs):
        a, b = self.pop(2)
        self.push(b - a)

    def multiply(self, regex: r"\*", **kwargs):
        a, b = self.pop(2)
        self.push(b * a)

    def divide(self, regex: r"\/", **kwargs):
        a, b = self.pop(2)
        self.push(b / a)

    def pow(self, regex: r"exp", **kwargs):
        a, b = self.pop(2)
        self.push(a ** b)

    def logN(self, regex: r"logN", **kwargs):
        a, b = self.pop(2)
        self.push(math.log(a,b))

    # Unary operators
    def neg(self, regex: r"neg", **kwargs):
        self.push(-self.pop())

    def sqrt(self, regex: r"sqrt", **kwargs):
        self.push(math.sqrt(self.pop()))

    def log2(self, regex: r"log2", **kwargs):
        self.push(math.log2(self.pop()))

    def log10(self, regex: r"log10", **kwargs):
        self.push(math.log10(self.pop()))

    def pi(self, regex: r"pi", **kwargs):
        self.push(math.pi)

    def e(self, regex: r"e", **kwargs):
        self.push(math.e)

    def deg(self, regex: r"deg", **kwargs):
        self.push(math.degrees(self.pop()))

    def rad(self, regex: r"rad", **kwargs):
        self.push(math.radians(self.pop()))

    # Whole stack operators
    def cls(self, regex: r"c", **kwargs):
        self.stack=[]

    def sum(self, regex: r"sum", **kwargs):
        self.stack=[math.fsum(self.stack)]


if __name__ == '__main__':
    calc = RPNCalculator()

    print(calc('2 2 exp 3 + neg'))

    print(calc('c 1 2 3 4 5 sum 2 * 2 / pi'))

    print(calc('pi 2 * deg'))

    print(calc('2 2 logN'))

我喜欢这个解决方案,因为没有单独的查找表。要匹配的正则表达式嵌入在要作为注释调用的方法中。对我来说,这是应该的。如果Python允许更灵活的注释会更好,因为我宁愿将正则表达式注释放在方法本身而不是将其嵌入方法参数列表中。但是,目前这是不可能的。

有兴趣的话,看一下Wolfram语言,其中函数是任意模式的多态,而不仅仅是参数类型。在正则表达式上具有多态性的函数是一个非常强大的想法,但我们无法在Python中完全实现。 RegexDispatcher类是我能做的最好的。

答案 14 :(得分:0)

import re

s = '1.23 Million equals to 1230000'

s = re.sub("([\d.]+)(\s*)Million", lambda m: str(round(float(m.groups()[0]) * 1000_000))+m.groups()[1], s)

print(s)

1230000等于1230000