我正在尝试在Dask数据帧中展平JSON数组对象(无文件.json),因为我有很多数据,并且RAM不断被进程消耗,所以我需要一种并行形式的解决方案
那是我拥有的JSON:
[ {
"id": "0001",
"name": "Stiven",
"location": [{
"country": "Colombia",
"department": "Choco",
"city": "Quibdo"
}, {
"country": "Colombia",
"department": "Antioquia",
"city": "Medellin"
}, {
"country": "Colombia",
"department": "Cundinamarca",
"city": "Bogota"
}
]
}, {
"id": "0002",
"name": "Jhon Jaime",
"location": [{
"country": "Colombia",
"department": "Valle del Cauca",
"city": "Cali"
}, {
"country": "Colombia",
"department": "Putumayo",
"city": "Mocoa"
}, {
"country": "Colombia",
"department": "Arauca",
"city": "Arauca"
}
]
}, {
"id": "0003",
"name": "Francisco",
"location": [{
"country": "Colombia",
"department": "Atlantico",
"city": "Barranquilla"
}, {
"country": "Colombia",
"department": "Bolivar",
"city": "Cartagena"
}, {
"country": "Colombia",
"department": "La Guajira",
"city": "Riohacha"
}
]
}
]
这是我拥有的数据框:
index id name location
0 0001 Stiven [{'country':'Colombia', 'department': 'Choco', 'city': 'Quibdo'}, {'country':'Colombia', 'department': 'Antioquia', 'city': 'Medellin'}, {'country':'Colombia', 'department': 'Cundinamarca', 'city': 'Bogota'}]
1 0002 Jhon Jaime [{'country':'Colombia', 'department': 'Valle del Cauca', 'city': 'Cali'}, {'country':'Colombia', 'department': 'Putumayo', 'city': 'Mocoa'}, {'country':'Colombia', 'department': 'Arauca', 'city': 'Arauca'}]
2 0003 Francisco [{'country':'Colombia', 'department': 'Atlantico', 'city': 'Barranquilla'}, {'country':'Colombia', 'department': 'Bolivar', 'city': 'Cartagena'}, {'country':'Colombia', 'department': 'La Guajira', 'city': 'Riohacha'}]
我需要将每个id转换为dataframe,如下所示:
index id name country department city
0 0001 Stiven Colombia Choco Quibdo
1 0001 Stiven Colombia Antioquia Medellin
2 0001 Stiven Colombia Cundinamarca Bogota
3 0002 Jhon Jaime Colombia Valle del Cauca Cali
4 0002 Jhon Jaime Colombia Putumayo Mocoa
5 0002 Jhon Jaime Colombia Arauca Arauca
6 0003 Francisco Colombia Atlantico Barranquilla
7 0003 Francisco Colombia Bolivar Cartagena
8 0003 Francisco Colombia La Guajira Riohacha
所有进程必须与Dask并行。有什么建议吗?
谢谢。
答案 0 :(得分:0)
我建议首先使用Pandas数据帧解决此问题,然后使用Orders
函数将该函数应用于Dask数据帧内的所有Pandas分区。