Question

我有一个输入字符串，以某种方式提醒html代码，但是标记使用方括号而不是尖括号。

输入字符串为： str = r'Lorem ipsum [dolor] sit amet' 我使用模式使用re.split处理它 ptr = r'\[.*?\]' 检测标签。结果是一个列表 list = [r'Lorem ipsum ', r'[dolor]', r' sit amet'] 问题在于两个符号的序列\[应该表示[符号，而不是开始标记。

re.split(r'\[.*?\]', r'Lorem \[ipsum\] \\[dolor] sit amet') 给 [r'Lorem \', r'[ipsum\]', r' \\', r'[dolor]', r' sit amet'] 当我想得到 [r'Lorem ipsum \[dolor\] \\', r'[dolor]', r' sit amet']

所以我想了解的是如何重新拆分两个符号序列\x是转义序列，应将其视为单个符号吗？

Answer 1

将r放在字符串前面的目的是将其视为原始字符串-> Python将反斜杠视为文字字符。您应该只使用普通的字符串。

Answer 2

您可以使用

re.findall(r'(?:[^][\\]|\\.)+|\[[^][]*]', s)

请参见regex demo及其图形：

简介：

(?:[^][\\]|\\.)+-除]，[和\以外的任何字符的一个或多个序列或任何字符（但如果未指定{{ 1}}标志）以反斜杠转义
re.DOTALL-或
|-一个\[[^][]*]，然后是[和[以外的任何0+字符，然后是]。

Python demo：

输出：

import re
rx = r"(?:[^][\\]|\\.)+|\[[^][]*]"
s = r"Lorem \[ipsum\] \\[dolor] sit amet"
results = re.findall(r'(?:[^][\\]|\\.)+|\[[^][]*]', s)
for result in results:
    print("'{}'".format(result))

Answer 3

我猜测我们可能希望将字符串分成三部分，因为我们可以只使用三个捕获组并收集我们希望输出的内容，也许类似于：

(.+?\\\\)(\[.+\])(.+)

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(.+?\\\\)(\[.+\])(.+)"

test_str = "Lorem \\[ipsum\\] \\\\[dolor] sit amet"

subst = ""

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Demo

const regex = /(.+?\\\\)(\[.+\])(.+)/gm;
const str = `Lorem \\[ipsum\\] \\\\[dolor] sit amet`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Answer 4

谢谢您的建议。感谢他们，我找到了所需的解决方案。

要显示问题，请用[atag]代替标签<atag>。但是，序列\[应该保持不变，因为它是“转义的”。反斜杠也应“转义”，即\\保持不变，以便\\[转换为\\<。但是，\\\[仍然是\\\[。等等。

以下代码

import re

test_str = r"Lorem \[ipsum\] dolor [sit\]amet], consetetur [sadipscing\\] elitr, \\[sed] diam [nonumy]"

regex = r"(?P<prefix>[^\\](\\\\)*)\[(?P<expression>.*?[^\\](\\\\)*)\]"
subst = r"\g<prefix><\g<expression>>"

print (re.sub(regex, subst, test_str))

产生

Lorem \[ipsum\] dolor <sit\]amet>, consetetur <sadipscing\\> elitr, \\<sed> diam <nonumy>

如愿。

希望这会对某人有所帮助。

Python转义re.split将反斜杠+符号视为另一个符号

4 个答案:

测试

Demo