使用嵌套字典拼合JSON字符串列表

时间:2020-09-05 20:56:27

标签: python json pandas dictionary

我想转换list中的以下tuple

[('1599324732926-0',
     {'data': '{"timestamp":1599324732.767,
                "receipt_timestamp":1599324732.9256856,
                "delta":true,
                "bid":{"338.9":0.06482,"338.67":3.95535},
                "ask":{"339.12":2.47578,"339.13":6.43172}
               }'
     }
 )
 ('1599324732926-1',
     {'data': '{"timestamp":1599324832.767,
                "receipt_timestamp":1599324832.9256856,
                "delta":true,
                "bid":{"338.8":0.06482,"338.57":3.95535},
                "ask":{"340.12":2.47578,"340.13":6.43172}
               }'
     }
 )
]

放入listdict或一个数据帧(无论是哪种,从一个到另一个都不复杂):

[{
  'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.9,
  'size': 0.06482},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.67,
  'size': 3.95535},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.66,
  'size': 16.78636},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.63,
  'size': 2.5},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.45,
  'size': 6.06071},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.38,
  'size': 0.0},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.95,
  'size': 0.0},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 338.96,
  'size': 0.0},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 339.11,
  'size': 0.0},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 339.12,
  'size': 2.47578},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 339.13,
  'size': 6.43172},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 339.36,
  'size': 0.0},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 339.52,
  'size': 6.5},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 341.18,
  'size': 0.0},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 341.19,
  'size': 0.0},
  ...
]

基本上,

  • 第一个ID被删除(实际上,它保存在单独的列表中)。
  • data中的数据是具有嵌套字典的JSON对象。
  • 诀窍在于,“ bid”和“ ask”成为结果字典中名为“ side”的键的值。
  • 嵌套字典“ bid”和“ ask”的键成为结果字典中名为“ price”的键的值。
  • 用于名为“ size”的键的价格停留值。

我能够分别处理列表中的每个JSON元素。 但是列表最多可以包含60万个元素。 我问是否可以使用一些熊猫或numpy函数将列表作为一个整体进行处理,以使其变得更快?

我看了熊猫json_normalize(),但根据给出的示例,字典的键是系统地列,而在这种情况下,“价格”键成为“价格”列的值。

您知道我该怎么做吗?有什么方法可以1st预处理JSON列表,以便可以用json_normalize()对其进行进一步处理。

仅作参考,以下是我可以编写以分别处理列表中每个元素的代码,但我认为这不是正确的方向。下一步是将其封装在for循环中,这与管理整个列表的解决方案相比要慢得多。

import json

data_light = ('1599324732926-0',
     {'data': '{"timestamp":1599324732.767, \
                "receipt_timestamp":1599324732.9256856,\
                "delta":true, \
                "bid":{"338.9":0.06482,"338.67":3.95535}, \
                "ask":{"339.12":2.47578,"339.13":6.43172} \
               }'
     }
 )

var=json.loads(data_light[1]['data'])
var_bid=var['bid']
var_ask=var['ask']
mylist=list(var_bid.items())+list(var_ask.items())

it = ['ask'] * len(var_ask) + ['bid'] * len(var_bid)

timestamp=var['timestamp']
receipt_timestamp=var['receipt_timestamp']
delta=var['delta']
midx = pd.MultiIndex.from_product([[timestamp], [receipt_timestamp], [delta],it], names=['timestamp', 'receipt_timestamp', 'delta', 'side'])

df=pd.DataFrame(mylist, index=midx, columns=['price', 'size'], dtype=float)
my_dict=df.reset_index().to_dict('records')

2 个答案:

答案 0 :(得分:1)

这不是您的问题的确切答案,因为它不是熊猫或numpy的实现,但我认为它应该可以满足您的需求。

尝试看看multiprocessing.pool.Pool.map

假设您有一个函数可以从原始列表中接收元组并返回所需的数据字典。可以说它的签名如下:

def tuple_to_dict(input):
    # conversion code goes here
    return result_dict

然后您可以像这样使用multiprocessing.Pool():

import multiprocessing


if __name__ == '__main__':

    input_list = [...] # your input list

    with multiprocessing.Pool() as pool:
        result_list = pool.map(tuple_to_dict, input_list)
        print(result_list)

注意:

  1. 应该将Pool()对象的创建放置在if __name__ == "__main__"块或从那里调用的函数(recursivley)中,否则会出现RuntimeError

  2. with ... as...放在此处,以便在使用结束或失败时关闭Pool对象。如果您不使用“ with / as”语法,请在try / catch块中使用它,并在pool.close()块中添加finally语句以确保该池已关闭。

答案 1 :(得分:1)

  • 与使用pandas.json_normalize相比,迭代提取信息更容易。
  • 如示例数据所示,data的值为str类型,必须将其转换为dict
  • 主要任务是从keyvalue中提取每个'bid' 'ask'对,以创建单独的记录。
    • 列表理解执行创建单独记录的任务。
import json
import pandas

# list of tuples, where the value of data, is a string
transaction_data = [('1599324732926-0', {'data': '{"timestamp":1599324732.767, "receipt_timestamp":1599324732.9256856, "delta":true, "bid":{"338.9":0.06482,"338.67":3.95535}, "ask":{"339.12":2.47578,"339.13":6.43172}}'}),
                    ('1599324732926-1', {'data': '{"timestamp":1599324732.767, "receipt_timestamp":1599324732.9256856, "delta":true, "bid":{"338.9":0.06482,"338.67":3.95535}, "ask":{"339.12":2.47578,"339.13":6.43172}}'}),
                    ('1599324732926-2', {'data': '{"timestamp":1599324732.767, "receipt_timestamp":1599324732.9256856, "delta":true, "bid":{"338.9":0.06482,"338.67":3.95535}, "ask":{"339.12":2.47578,"339.13":6.43172}}'})]

# create a list of lists for each transaction data
# split each side, key value pair into a separate list
data_key_list = [['timestamp', 'receipt_timestamp', 'delta', 'side', 'price', 'size']]

for v in transaction_data:  # # iterate through each transaction
    data = json.loads(v[1]['data'])  # convert the string to a dict
    for side in ['bid', 'ask']:  # extract each key, value pair as a separate record
        data_key_list += [[data['timestamp'], data['receipt_timestamp'], data['delta'], side, float(k), v] for k, v in data[side].items()]

# create a dataframe
df = pd.DataFrame(data_key_list[1:], columns=data_key_list[0])

# display(df.head())
     timestamp  receipt_timestamp  delta side   price     size
0  1.59932e+09        1.59932e+09   True  bid   338.9  0.06482
1  1.59932e+09        1.59932e+09   True  bid  338.67  3.95535
2  1.59932e+09        1.59932e+09   True  ask  339.12  2.47578
3  1.59932e+09        1.59932e+09   True  ask  339.13  6.43172
4  1.59932e+09        1.59932e+09   True  bid   338.9  0.06482

转换为字典列表

df.to_dict(orient='records')

[out]:
[{'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 338.9,
  'size': 0.06482},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'bid',
  'price': 338.67,
  'size': 3.95535},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 339.12,
  'size': 2.47578},
 {'timestamp': 1599324732.767,
  'receipt_timestamp': 1599324732.9256856,
  'delta': True,
  'side': 'ask',
  'price': 339.13,
  'size': 6.43172},
 ...]