我试图在字符串中转义双引号,以便为json.loads加载它。下面的代码试图弄清楚如何正确使用它。
import re
one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
print re.sub('("caption":".*?)"(.*?",")', r'\1\"\2', one)
print re.sub('("caption":".*?)"(.*?",")', r'\1\"\2', two)
这是当前的输出。
"caption":"This caption should not match nor have any double quotes escaped","
"caption":"This caption \"should have "the duobles quotes" in the caption escaped"","
问题是只有第二个字符串中的第一个双引号才会被转义。我意识到我的正则表达式中存在错误,这不是我的强项。我在这里阅读了大量的帖子,并在谷歌上花了很多时间无济于事。
请注意,我使用的实际字符串长约10 000个字符,并且多次出现两种类型的字幕字符串。
答案 0 :(得分:0)
>>> import re
>>> one = '"caption":"This caption should not match nor have any double quotes escaped","'
>>> two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
>>> match = re.match(r"(\"caption\"\:\")(.*)(\",\")", two)
>>> midstr = match.group(2).replace('"', u'\u005C"')
>>> newstr = "".join([match.group(1), midstr, match.group(3)])
>>> print newstr
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"","
答案 1 :(得分:0)
import re
expression = """
( # Capturing group 1
[\w ] # The quote should be preceeded by a word char or space.
) # End group
(") # Capturing group 2: match a quote character.
( # Capturing group 3
[^,:] # Quote shuold not be followed by a comma or colon.
) # End group
"""
pattern = re.compile(expression, re.VERBOSE)
result = pattern.sub(r'\1\"\2', one)
print(result)
Demo 更新了错误修复。
答案 2 :(得分:0)
我会尝试re.sub
,如下所示 -
one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
result= re.sub(r"""(?<!^)(?<!:)(")(?!$)(?!:)""",r'\\\1',two)
print result
输出 -
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"\","
LIVE DEMO
正则表达式解释
抓住所有不在行开头/结尾的引号,而不是在第一个:
之前或之后,然后用准备好的反斜杠替换它们(即\"
)
答案 3 :(得分:0)
如果您安装了regex package(如评论中所述),这应该可行:
result = regex.sub(r'(?<="caption":".*)"(?=.*",")', r'\"', subject)
正如您所看到的,除了我将捕获组更改为外观之外,正则表达式与您的相同。由于不再使用字符串的那些部分,因此无需将它们插回到新字符串中,因此替换只是\"
。
我不能说这个正则表达式的效率,因为我对周围的文字一无所知。如果目标字符串在各自的行上,只要不指定DOTALL模式就应该没问题。但最安全的方法是首先提取字符串并单独处理它们。
答案 4 :(得分:0)
# fourth parameter is the position; following will remove 1st occurrence of "so"
sent = 'we are having so so much of fun'
re.sub("so",'', sent, 1)