RegEx:替换文本,除非它在引号之间

时间:2018-04-04 00:50:57

标签: python regex

我正在研究一个转换器,并希望用Python的代码替换我的语言标记。替换是这样完成的:

for rep in reps:
    pattern, translated = rep;

    # Replaces every [pattern] with [translated] in [transpiled]
    transpiled = re.sub(pattern, translated, transpiled, flags=re.UNICODE)

其中reps(regex to be replaced, string to replace it with)有序对的列表,而transpiled是要编译的文本。

但是,我似乎找不到在替换过程中引用之间排除文本的方法。请注意,这是针对某种语言的,因此它也适用于转义引号和单引号。

2 个答案:

答案 0 :(得分:1)

这可能取决于您如何定义模式,但一般情况下,您可以使用前瞻和后瞻组围绕pattern,以确保引号之间的文字不匹配:

import re

transpiled = "A foo with \"foo\" and single quoted 'foo'. It even has an escaped \\'foo\\'!"

reps = [("foo", "bar"), ("and", "or")]

print(transpiled)  # before the changes

for rep in reps:
    pattern, translated = rep
    transpiled = re.sub("(?<=[^\"']){}(?=\\\\?[^\"'])".format(pattern),
                        translated, transpiled, flags=re.UNICODE)
    print(transpiled)  # after each change

将产生:

A foo with "foo" and single quoted 'foo'. It even has an escaped \'foo\'!
A bar with "foo" and single quoted 'foo'. It even has an escaped \'foo\'!
A bar with "foo" or single quoted 'foo'. It even has an escaped \'foo\'!

更新:如果您想忽略整个引用的文字区域,而不仅仅是引用的字词,那么您将需要做更多的工作。虽然你可以通过寻找重复引用来实现它,但是整个前瞻/后瞻机制会变得非常混乱并且可能远非最佳 - 它更容易将引用与非引用文本分开并仅在前者中进行替换,类似的东西:

import re

QUOTED_STRING = re.compile("(\\\\?[\"']).*?\\1")  # a pattern to match strings between quotes

def replace_multiple(source, replacements, flags=0):  # a convenience replacement function
    if not source:  # no need to process empty strings
        return ""
    for r in replacements:
        source = re.sub(r[0], r[1], source, flags=flags)
    return source

def replace_non_quoted(source, replacements, flags=0):
    result = []  # a store for the result pieces
    head = 0  # a search head reference
    for match in QUOTED_STRING.finditer(source):
        # process everything until the current quoted match and add it to the result
        result.append(replace_multiple(source[head:match.start()], replacements, flags))
        result.append(match[0])  # add the quoted match verbatim to the result
        head = match.end()  # move the search head to the end of the quoted match
    if head < len(source):  # if the search head is not at the end of the string
        # process the rest of the string and add it to the result
        result.append(replace_multiple(source[head:], replacements, flags))
    return "".join(result)  # join back the result pieces and return them

您可以将其测试为:

print(replace_non_quoted("A foo with \"foo\" and 'foo', says: 'I have a foo'!", reps))
# A bar with "foo" or 'foo', says: 'I have a foo'!
print(replace_non_quoted("A foo with \"foo\" and foo, says: \\'I have a foo\\'!", reps))
# A bar with "foo" or bar, says: \'I have a foo\'!
print(replace_non_quoted("A foo with '\"foo\" and foo', says - I have a foo!", reps))
# A bar with '"foo" and foo', says - I have a bar!

作为额外的好处,这也允许您将完全合格的正则表达式模式定义为替换:

print(replace_non_quoted("My foo and \"bar\" are like 'moo' and star!",
                        (("(\w+)oo", "oo\\1"), ("(\w+)ar", "ra\\1"))))
# My oof and "bar" are like 'moo' and rast!

但是如果您的替换不涉及模式并且只需要一个简单的替换,那么您可以使用明显更快的 native 替换re.sub()辅助函数中的replace_multiple() {{1 }}

最后,如果你不需要复杂的模式,你可以完全摆脱正则表达式:

str.replace()

答案 1 :(得分:0)

您可能希望使用内置shlex模块的Python,而不仅仅使用正则表达式。它设计用于处理在shell中找到的引用字符串,包括嵌套示例。

import shlex
shlex.split("""look "nested \\"quotes\\"" here""")
# ['look', 'nested "quotes"', 'here']