Question

我有一个示例字符串，看起来像字典，但是包含带有双引号和逗号的值，这使得使用json.loads难以读取。我正在编写代码以查找元素并提取其值，直到找到“，”并将其存储为列表，以便可以将数据转换为数据框。

示例：

filtered_data = '''[
   {
      "_id":"1231",
      "address":"akjd-dfdkfj",
      "body":"Your one time password is "sdkd". Enter this in the form to confirm your value.",
      "date":"Thu May 10 23:34:11 GMT+05:30 2018"
   },
   {
      "_id":"1245",
      "address":"sdsd-dgfg",
      "body":"Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC",
      "date":"Thu May 10 13:22:54 GMT+05:30 2018"
   }
]'''

到目前为止编写的代码：

import re
id_locs  = [(m.start(0), m.end(0)) for m in re.finditer('_id', filtered_data)]

如何通过将结束字符串指定为“，”来提取值？

预期输出：

{
    "_id": [
        "1231",
        "1245"
    ],
    "_address": [
        "akjd-dfdkfj",
        "sdsd-dgfg"
    ],
    "body": [
        "Your one time password is 'sdkd'. Enter this in the form to confirm your value.",
        "Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC"
    ],
    "date": [
        "Thu May 10 23:34:11 GMT+05:30 2018",
        "Thu May 10 13:22:54 GMT+05:30 2018"
    ]
}

Answer 1

假定非转义引号仅出现在curl "https://instance.service-now.com/api/now/attachment/upload" \ --request POST \ --header "Accept:application/json" \ --user "'admin':'admin'" \ --header "Content-Type:multipart/form-data" \ -F 'table_name=incident' \ -F 'table_sys_id=d71f7935c0a8016700802b64c67c11c6' \ -F 'uploadFile=@image.png'行中，则可以将其固定为适当的JSON，然后进行解析。然后，您需要完成将字典列表重塑为列表字典的任务。

"body"

Answer 2

如果字符串将形成有效的json文档，则可以使用（您可能只需要在字符串的开头/结尾添加'['和']'：

import json
str2="""[{"_id":"1231","address":"akjd-dfdkfj","body": "Your one time password is sdkd. Enter this in the form to confirm your value.","date":"Thu May 10 23:34:11 GMT+05:30 2018"},{"_id":"1245","address":"sdsd-dgfg","body":"Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC","date":"Thu May 10 13:22:54 GMT+05:30 2018"}]"""
>>> json.loads(str2, encoding='UTF8')
result_dicts=json.loads(str2)

然后将字典“合并”在一起，如下所示：

result_dict= dict()
for res_dict in result_dicts:
    for key, value in res_dict.items():
        result_dict.setdefault(key, list()).append(value)

但是如果您的示例字符串确实像描述中的样子，则它不是有效的json字符串，因为使用了不正确的双引号（例如，在“您的一次性密码为“ sdkd”中。）在表格中输入以确认您的价值。”），则需要您自己解析。

要应用正则表达式，您需要做一些假设以将字符串切成有效的片段。例如。是否可以安全地假定字段值始终用双引号引起来？还是可以假设一个字段值从不包含字符组合<“，>和<”}>（我使用<>来限制字符串）？

如果是这样，则可以以这种方式构建正则表达式，使其切出由这些字符串之一分隔的子字符串，以获取字段名称和字段值。没有这样的假设，就无法解决问题。

Answer 3

带有正则表达式的解决方案如下：

patt=re.compile('"([^"]*)"\s*:\s*"(.*?)"(,|\s*\})', re.MULTILINE)
result_dict=dict()
at_end= False
pos= 0
while not at_end:
    matcher= patt.search(filtered_data, pos= pos)
    at_end= matcher is None
    if not at_end:
        key, value, _= matcher.groups()
        result_dict.setdefault(key, list()).append(value)
        _, pos= matcher.span()

假设您的示例数据中的键，值对始终以'“，”或'“ \ s *}'结尾。

使用findall，它看起来更紧凑：

patt=re.compile('"([^"]*)"\s*:\s*"(.*?)"(,|\s*\})', re.MULTILINE)
result_dict=dict()
for key, value, sep in patt.findall(filtered_data):
    result_dict.setdefault(key, list()).append(value)

Answer 4

这是使用正则表达式的一种方法。使用lookahead & lookbehind

获取值

例如：

import re

filtered_data = '''[
   {
      "_id":"1231",
      "address":"akjd-dfdkfj",
      "body":"Your one time password is "sdkd". Enter this in the form to confirm your value.",
      "date":"Thu May 10 23:34:11 GMT+05:30 2018"
   },
   {
      "_id":"1245",
      "address":"sdsd-dgfg",
      "body":"Dear Customer, Reference number is 3435.To check latest status, sms DROP DFGDG on 38388338. Thank you, ABC",
      "date":"Thu May 10 13:22:54 GMT+05:30 2018"
   }
]'''
keys = set(re.findall(r'\"(.+)\":', filtered_data))   #Get Keys
result = {}
for key in keys:
    result[key] = re.findall(r'(?<=\"{}":\")(.*?)(?=\",?)'.format(key), filtered_data)   #Get Values.

print(result)

输出：

{'_id': ['1231', '1245'],
 'address': ['akjd-dfdkfj', 'sdsd-dgfg'],
 'body': ['Your one time password is ',
          'Dear Customer, Reference number is 3435.To check latest status, sms '
          'DROP DFGDG on 38388338. Thank you, ABC'],
 'date': ['Thu May 10 23:34:11 GMT+05:30 2018',
          'Thu May 10 13:22:54 GMT+05:30 2018']}

遍历一串字典并将其存储为python中的单个字典

4 个答案: