Question

我有一个字符串，其结构类似于json.loads(...)，如下所示：

JSON

我想删除引号内的引号，如下所示：

"data1":{

    "data2":{
        "x":"Some text"
    },
    "source": "source: <a href=\"http://example.com/content/123/data123.pdf\">example.com</a>" }
    "data3":{
        "format": "f() { return this.data52 / 20 + "x"; }"
}

输入中给出的字符串要长得多，我有很多像这样的字符串来格式化。如您所见，某些引号已更改为"data1":{ "data2":{ "x":"Some text" }, "source": "source: <a href=\"http://example.com/content/123/data123.pdf\">example.com</a>" } "data3":{ "format": "f() { return this.data52 / 20 + \"x\"; }" }。我试过这个：

\"

但是它取代了所有引号，我试图使用否定的外观，但只有当我只有一个引用而另一个引用时才会起作用。有没有办法用正则表达式做到这一点？我总是可以迭代字符串并计算引号，但这不是我想的最佳解决方案。谢谢你的帮助！

@EDIT

我创建了@L3viathan提供的算法不起作用的地方：

string = re.sub(r"\"(.*)\"",  r"\1", string).re.replace("\"", "\\\"")

＆＃34; hereDoesntWork`中的文字只是被忽略了。问题是我不知道这些字符串是如何嵌套的。

Answer 1

我没有设法用一个正则表达式做，但两个应该这样做：

import re

pattern = '"(.*?)"(.*?)"(.*)"'

s = """"data1":{

    "data2":{
        "x":"Some text"
    },
    "source": "source: <a href=\"http://example.com/content/123/data123.pdf\">example.com</a>" }
    "data3":{
        "format": "f() { return this.data52 / 20 + "x"; }"
}"""

def fixer(match):
    key = match.group(1)
    middle = match.group(2)
    content = match.group(3)
    print(key, middle, content)
    return '"{}"{}"{}"'.format(
            key,
            middle,
            re.sub(r'(?<!\\)"', r'\"', content),
    )

print(re.sub(pattern, fixer, s))

首先，我使用函数作为替换参数调用re.sub，这会导致它使用匹配为它找到的每个匹配调用它，并将其替换为该函数的返回值。

第一个正则表达式（pattern）只匹配带有四个或更多引号的行，与最后两个之间的部分贪婪地匹配。第二个正则表达式（在fixer中）匹配不带反斜杠的引号字符。

查找类似json的字符串中的所有引号，并使用Python替换内部引号

1 个答案: