我知道这个标题在这里似乎相当受欢迎,但快速浏览它们通常会涉及提问者有一个孤立的JSON部分的情况。
有些情况下"
用于表示英寸,或者它包含一个短语来表示某种昵称,无论是出现在JS对象的值字符串中,它都已用双引号括起来
这是我遇到问题的JS对象字符串的一个示例(我正在使用正则表达式来引用键并删除额外的空格,但这是其所有荣耀中的已删除字符串):
'{\n\t\t\n\t\t\t\t\t\n\t\n\n\t\n\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"16241885":{title: "Nosefrida Fridababy Windi Gas & Colic Relief", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"8650356":{title: "Babyganics Face- Hand & Baby Wipes- Fragrance Free- 100 Count", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"16249889":{title: "Nosefrida Nasal Aspirator Replacement Filters", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"8650355":{title: "Babyganics Face- Hand & Baby Wipes- Fragrance Free- 40 Count", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"15490928":{title: "BabyGanics Newborn Ultra Absorbent Jumbo Size Diapers - 36 Count", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"14712536":{title: "Marvel Superhero Bandages", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"16263505":{title: "Nosefrida "The Snotsucker" Nasal Aspirator", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"14848093":{title: "Zarbee\'s Children\'s Cough Syrup - Grape", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t \n\n\t\t\n\t}'
我首先在字符串上尝试了json.dumps
,但只是双重转义并需要一个双json.loads
,这让我回到原点。我试过像这样的正则表达式:
double_quotes_in_json = re.compile(r'(?<=:)(\s*"[^"]*)(")([^"]*)(")?(?=[^"]*",|"\s*\})')
def escape_double_quotes(jsn_string, pattern=double_quotes_in_json):
for match in pattern.finditer(jsn_string):
# current pattern only matches 1 instance of either one double quote in JSON value string
# (presumably signifying inches) or 1 instance of phrase wrapped in double quotes
# for something like nicknames
# matches will have either 3 or 4 groups, representing one of the 2 match types described above
groups_matched = len(match.groups())
entire_match = match.group()
if groups_matched == 3:
# we only matched one double quote
subbed_match = pattern.sub('$1\\$2$3', entire_match)
jsn_string = re.sub(entire_match, subbed_match, jsn_string)
elif groups_matched == 4:
# we matched a phrase wrapped in double quotes
subbed_match = pattern.sub('$1\\$2$3\\$4', entire_match)
jsn_string = re.sub(entire_match, subbed_match, jsn_string)
return jsn_string
虽然这似乎是最有希望的,但似乎重新插入双引号而没有我在sub中的转义字符,同时也没有回到第一组。(我尝试过使用和不使用原始字符串在子函数r
中)所以对于上面的问题部分(下面是一个子字符串):
"16263505":{title: "Nosefrida "The Snotsucker" Nasal Aspirator"
该模式不会将第1组子组合回来,并且由于某种原因,单引号中的subs(下面是失败的正则表达式处理的子字符串):
"16263505":{title: "The Snotsucker"' Nasal Aspirator"
无论哪种方式json.loads
都抱怨未转义的"
。
编辑1: 我的正则表达式可以取出未转义的引号,但是将它重新加入并不符合预期,我可能在这里做了一些愚蠢的事情并且可以使用一双新鲜的眼睛。
带有print语句的函数输出示例:
low_inventory = response.xpath(
'//script[contains(., "islistEligibility") or contains(., "ishlistEligibility")]/text()'
).re_first(r'(?s)(?<=registryWislistEligibilityMap)(?:\s*=\s*)(\{.+\})')
In [453]: for m in double_quotes_in_json.finditer(low_inventory):
...: groups_matched = len(m.groups())
...: print('groups: ', m.groups())
...: entire_match = m.group()
...: print('entire match: ', m.group())
...: if groups_matched == 3:
...: # we only matched a single double quote
...: subbed_match = double_quotes_in_json.sub(r'$1\\$2$3', entire_match)
...: print('subbed3: ', subbed_match)
...: jsn_string = re.sub(entire_match, subbed_match, jsn_string)
...: elif groups_matched == 4:
...: subbed_match = double_quotes_in_json.sub(r'$1\\$2$3\\\$4', entire_match)
...: print('subbed4: ', subbed_match)
...: jsn_string = re.sub(entire_match, subbed_match, jsn_string)
...: print(jsn_string)
...:
groups: (' "Nosefrida ', '"', 'The Snotsucker', '"')
entire match: "Nosefrida "The Snotsucker"
subbed4: "Nosefrida "The Snotsucker"
{ "16241885":{"title": "Nosefrida Fridababy Windi Gas & Colic Relief", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "8650356":{"title": "Babyganics Face- Hand & Baby Wipes- Fragrance Free- 100 Count", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "16249889":{"title": "Nosefrida Nasal Aspirator Replacement Filters", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "8650355":{"title": "Babyganics Face- Hand & Baby Wipes- Fragrance Free- 40 Count", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "15490928":{"title": "BabyGanics Newborn Ultra Absorbent Jumbo Size Diapers - 36 Count", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "14712536":{"title": "Marvel Superhero Bandages", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "16263505":{"title": "The Snotsucker"' Nasal Aspirator", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "14848093":{"title": "Zarbee's Children's Cough Syrup - Grape", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true} }
答案 0 :(得分:0)
由于某种原因,使用pythons builtin替换函数实现了所需的结果,而re.sub没有正确地逃避双引号。 (这是使用带有单个转义的原始字符串中的组引用或具有双转义的常规字符串)。无论哪种方式,这是工作功能。如果有人对于为什么使用替换工作有一些了解re.sub我会非常感兴趣的原因。
(旧代码注释掉)
double_quotes_in_json = re.compile(r'(?<=:)(\s*")([^"]*)(")([^"]*)(")?(?=[^"]*",|"\s*\})')
def escape_double_quotes(jsn_string, pattern=double_quotes_in_json):
for match in pattern.finditer(jsn_string):
# current pattern only matches 1 instance of either one double quote in JSON value string
# (presumably signifying inches) or 1 instance of phrase wrapped in double quotes
# for something like nicknames
# matches will have either 3 or 4 groups, representing one of the 2 match types described above
num_groups_matched = len(match.groups())
groups = match.groups()
entire_match = match.group()
print('groups: ', match.groups())
print('entire: ', entire_match)
if num_groups_matched == 4:
# we only matched one double quote
# subbed_match = pattern.sub('$1$2\\$3$4', entire_match)
# jsn_string = re.sub(entire_match, subbed_match, jsn_string)
target = ''.join(groups[1:4])
replaced = target.replace('"', '\\"')
print(replaced)
jsn_string = jsn_string.replace(target, replaced)
elif num_groups_matched == 5:
# we matched a phrase wrapped in double quotes
# subbed_match = pattern.sub('$1$2\\$3$4\\$5', entire_match)
# jsn_string = re.sub(entire_match, subbed_match, jsn_string)
target = ''.join(groups[1:])
replaced = target.replace('"', '\\"')
print(replaced)
jsn_string = jsn_string.replace(target, replaced)
return jsn_string
编辑#1(AKA:在一些睡眠方法之后):
double_quotes_in_title_attr = re.compile(
r'(?<="title":)(?:\s*")(?P<value>.+?)(?=",\s*"\w+":|"\s*\})'
)
def escape_double_quotes_in_title(jsn_string, pattern=double_quotes_in_title_attr):
for match in pattern.finditer(jsn_string):
target = match.group('value')
replaced = target.replace('"', '\\"')
jsn_string = jsn_string.replace(target, replaced)
return jsn_string
# use this first to properly quote keys so the above pattern will match
unquoted_key_pattern = re.compile(r'(?!")(\'?(?P<key>\w+)\'?)(?=:\s*(?:"|false|true|\d|\[|\{))')
def fix_json_keys(jsn, pattern=unquoted_key_pattern):
return pattern.sub(r'"\g<key>"', jsn)
感谢@deceze的帮助。