Question

我试图在字符串中转义双引号，以便为json.loads加载它。下面的代码试图弄清楚如何正确使用它。

import re

one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'

print re.sub('("caption":".*?)"(.*?",")', r'\1\"\2', one)
print re.sub('("caption":".*?)"(.*?",")', r'\1\"\2', two)

这是当前的输出。

"caption":"This caption should not match nor have any double quotes escaped","
"caption":"This caption \"should have "the duobles quotes" in the caption escaped"","

问题是只有第二个字符串中的第一个双引号才会被转义。我意识到我的正则表达式中存在错误，这不是我的强项。我在这里阅读了大量的帖子，并在谷歌上花了很多时间无济于事。

请注意，我使用的实际字符串长约10 000个字符，并且多次出现两种类型的字幕字符串。

Answer 1

>>> import re
>>> one = '"caption":"This caption should not match nor have any double quotes escaped","'
>>> two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
>>> match = re.match(r"(\"caption\"\:\")(.*)(\",\")", two)
>>> midstr = match.group(2).replace('"', u'\u005C"')
>>> newstr = "".join([match.group(1), midstr, match.group(3)])
>>> print newstr
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"","

Answer 2

import re

expression = """
(             # Capturing group 1
[\w ]         # The quote should be preceeded by a word char or space.
)             # End group

(")           # Capturing group 2: match a quote character.

(             # Capturing group 3
[^,:]         # Quote shuold not be followed by a comma or colon.
)             # End group
"""
pattern = re.compile(expression, re.VERBOSE)

result = pattern.sub(r'\1\"\2', one)
print(result)

Demo 更新了错误修复。

Answer 3

我会尝试re.sub，如下所示 -

one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
result= re.sub(r"""(?<!^)(?<!:)(")(?!$)(?!:)""",r'\\\1',two)
print result

输出 -

"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"\","

LIVE DEMO

正则表达式解释

抓住所有不在行开头/结尾的引号，而不是在第一个:之前或之后，然后用准备好的反斜杠替换它们（即\"）

Answer 4

如果您安装了regex package（如评论中所述），这应该可行：

result = regex.sub(r'(?<="caption":".*)"(?=.*",")', r'\"', subject)

正如您所看到的，除了我将捕获组更改为外观之外，正则表达式与您的相同。由于不再使用字符串的那些部分，因此无需将它们插回到新字符串中，因此替换只是\"。

我不能说这个正则表达式的效率，因为我对周围的文字一无所知。如果目标字符串在各自的行上，只要不指定DOTALL模式就应该没问题。但最安全的方法是首先提取字符串并单独处理它们。

Answer 5

# fourth parameter is the position; following will remove 1st occurrence of "so"
sent = 'we are having so so much of fun'
re.sub("so",'', sent, 1)

Python re.sub仅匹配第一次出现

5 个答案: