Python re.sub仅匹配第一次出现

时间:2015-11-22 04:51:54

标签: python regex substitution

我试图在字符串中转义双引号,以便为json.loads加载它。下面的代码试图弄清楚如何正确使用它。

import re

one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'

print re.sub('("caption":".*?)"(.*?",")', r'\1\"\2', one)
print re.sub('("caption":".*?)"(.*?",")', r'\1\"\2', two)

这是当前的输出。

"caption":"This caption should not match nor have any double quotes escaped","
"caption":"This caption \"should have "the duobles quotes" in the caption escaped"","

问题是只有第二个字符串中的第一个双引号才会被转义。我意识到我的正则表达式中存在错误,这不是我的强项。我在这里阅读了大量的帖子,并在谷歌上花了很多时间无济于事。

请注意,我使用的实际字符串长约10 000个字符,并且多次出现两种类型的字幕字符串。

5 个答案:

答案 0 :(得分:0)

>>> import re
>>> one = '"caption":"This caption should not match nor have any double quotes escaped","'
>>> two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
>>> match = re.match(r"(\"caption\"\:\")(.*)(\",\")", two)
>>> midstr = match.group(2).replace('"', u'\u005C"')
>>> newstr = "".join([match.group(1), midstr, match.group(3)])
>>> print newstr
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"","

答案 1 :(得分:0)

import re

expression = """
(             # Capturing group 1
[\w ]         # The quote should be preceeded by a word char or space.
)             # End group

(")           # Capturing group 2: match a quote character.

(             # Capturing group 3
[^,:]         # Quote shuold not be followed by a comma or colon.
)             # End group
"""
pattern = re.compile(expression, re.VERBOSE)

result = pattern.sub(r'\1\"\2', one)
print(result)

Demo 更新了错误修复。

答案 2 :(得分:0)

我会尝试re.sub,如下所示 -

one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
result= re.sub(r"""(?<!^)(?<!:)(")(?!$)(?!:)""",r'\\\1',two)
print result

输出 -

"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"\","

LIVE DEMO

正则表达式解释

抓住所有不在行开头/结尾的引号,而不是在第一个:之前或之后,然后用准备好的反斜杠替换它们(即\"

答案 3 :(得分:0)

如果您安装了regex package(如评论中所述),这应该可行:

result = regex.sub(r'(?<="caption":".*)"(?=.*",")', r'\"', subject)

正如您所看到的,除了我将捕获组更改为外观之外,正则表达式与您的相同。由于不再使用字符串的那些部分,因此无需将它们插回到新字符串中,因此替换只是\"

我不能说这个正则表达式的效率,因为我对周围的文字一无所知。如果目标字符串在各自的行上,只要不指定DOTALL模式就应该没问题。但最安全的方法是首先提取字符串并单独处理它们。

答案 4 :(得分:0)

# fourth parameter is the position; following will remove 1st occurrence of "so"
sent = 'we are having so so much of fun'
re.sub("so",'', sent, 1)