正则表达式重新格式化不正确的JSON数据

时间:2016-12-16 13:38:03

标签: python json

我有一些数据没有正确保存在旧数据库中。我正在将系统移动到新数据库并重新格式化旧数据。旧数据如下所示:

a:10:{
    s:7:"step_no";s:1:"1";
    s:9:"YOUR_NAME";s:14:"Firtname Lastname";
    s:11:"CITIZENSHIP"; s:7:"Indian";
    s:22:"PROPOSE_NAME_BUSINESS1"; s:12:"ABC Limited";
    s:22:"PROPOSE_NAME_BUSINESS2"; s:15:"XYZ Investment";
    s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";
    s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";
    s:23:"PURPOSE_NATURE_BUSINESS";s:15:"Some dummy content";
    s:15:"CAPITAL_COMPANY";s:24:"20 Million Capital";
    s:14:"ANOTHER_AMOUNT";s:0:"";
}

我希望新外观采用正确的JSON格式,因此我可以像这样阅读python jut:

data = {
    "step_no": "1",
    "YOUR_NAME":"Firtname Lastname",
    "CITIZENSHIP":"Indian",
    "PROPOSE_NAME_BUSINESS1":"ABC Limited",
    "PROPOSE_NAME_BUSINESS2":"XYZ Investment",
    "PROPOSE_NAME_BUSINESS3":"",
    "PROPOSE_NAME_BUSINESS4":"",
    "PURPOSE_NATURE_BUSINESS":"Some dummy content",
    "CAPITAL_COMPANY":"20 Million Capital",
    "ANOTHER_AMOUNT":""
}

我正在考虑使用正则表达式删除不需要的部分,并使用大写字母中的名称重新格式化内容会起作用,但我不知道该怎么做。

1 个答案:

答案 0 :(得分:2)

正则表达式在这里是错误的方法。没有必要,格式比你想象的要复杂一点。

您有PHP serialize format中的数据。您可以使用phpserialize library

在Python中对其进行简单的反序列化
import phpserialize
import json

def fixup_php_arrays(o):
    if isinstance(o, dict):
        if isinstance(next(iter(o), None), int):
            # PHP has no lists, only mappings; produce a list for
            # a dictionary with integer keys to 'repair'
            return [fixup_php_arrays(o[i]) for i in range(len(o))]
        return {k: fixup_php_arrays(v) for k, v in o.items()}
    return o

json.dumps(fixup_php(phpserialize.loads(yourdata, decode_strings=True)))

请注意,PHP字符串是字节字符串,而不是Unicode文本,因此特别是在Python 3中,如果您希望能够在事实之后解密您的键值对,则必须重新编码为JSON。 decode_strings=True标志会为您解决此问题。默认值为UTF-8,传入encoding参数以选择不同的编解码器。

PHP也使用数组作为序列,因此可能必须首先将带有整数键的解码dict对象转换为列表,这就是fixup_php_arrays()函数的作用

Demo(已修复数据,许多字符串长度已关闭且添加了空格):

>>> import phpserialize, json
>>> from pprint import pprint
>>> data = b'a:10:{s:7:"step_no";s:1:"1";s:9:"YOUR_NAME";s:18:"Firstname Lastname";s:11:"CITIZENSHIP";s:6:"Indian";s:22:"PROPOSE_NAME_BUSINESS1";s:11:"ABC Limited";s:22:"PROPOSE_NAME_BUSINESS2";s:14:"XYZ Investment";s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";s:23:"PURPOSE_NATURE_BUSINESS";s:18:"Some dummy content";s:15:"CAPITAL_COMPANY";s:18:"20 Million Capital";s:14:"ANOTHER_AMOUNT";s:0:"";}'
>>> pprint(phpserialize.loads(data, decode_strings=True))
{'ANOTHER_AMOUNT': '',
 'CAPITAL_COMPANY': '20 Million Capital',
 'CITIZENSHIP': 'Indian',
 'PROPOSE_NAME_BUSINESS1': 'ABC Limited',
 'PROPOSE_NAME_BUSINESS2': 'XYZ Investment',
 'PROPOSE_NAME_BUSINESS3': '',
 'PROPOSE_NAME_BUSINESS4': '',
 'PURPOSE_NATURE_BUSINESS': 'Some dummy content',
 'YOUR_NAME': 'Firstname Lastname',
 'step_no': '1'}
>>> print(json.dumps(phpserialize.loads(data, decode_strings=True), sort_keys=True, indent=4))
{
    "ANOTHER_AMOUNT": "",
    "CAPITAL_COMPANY": "20 Million Capital",
    "CITIZENSHIP": "Indian",
    "PROPOSE_NAME_BUSINESS1": "ABC Limited",
    "PROPOSE_NAME_BUSINESS2": "XYZ Investment",
    "PROPOSE_NAME_BUSINESS3": "",
    "PROPOSE_NAME_BUSINESS4": "",
    "PURPOSE_NATURE_BUSINESS": "Some dummy content",
    "YOUR_NAME": "Firstname Lastname",
    "step_no": "1"
}