我必须合并并创建一个字典列表中所有具有相同cluster_id的记录的JSON数组。例如:id:1和2具有相同的cluster_id字段,因此应按预期输出所示进行合并,并合并3个字段id,name,match_full_address对于新的字段记录应显示为JSON数组,对于ID为3的单例记录应相同
我的词典列表:
[{
'id': 1,
'name': 'Will Smith',
'match_full_address': 'Ridge Boulevard,123 Main Street,Branchburg,NJ',
'cluster_id': 91,
'lat': 18756.73,
'longi': -97.395351,
},
{
'id': 2,
'name': 'Sandra Bullock',
'match_full_address': 'New Castle,123 Mountain Ave,Branchburg,NJ',
'cluster_id': 91,
'lat': 18756.73,
'longi': -97.395351,
},
{
'id': 3,
'name': 'Tom Cruise',
'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ',
'cluster_id': 92,
'lat': 18756.73,
'longi': -97.395351,
}
]
预期产量:
[{
'cluster_id': 91,
'lat': 18756.73,
'longi': -97.395351,
'records': [{'id': 1,
'name': 'Will Smith',
'match_full_address': 'Ridge Boulevard,123 Main Street,Branchburg,NJ'},
{'id': 2,
'name': 'Sandra Bullock',
'match_full_address': 'New Castle,123 Mountain Ave,Branchburg,NJ'}]
},
{
'cluster_id': 92,
'lat': 18756.73,
'longi': -97.395351,
'records': [{ 'id': 3,
'name': 'Tom Cruise',
'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ'}
}
]
答案 0 :(得分:2)
这类问题很常见。答案始终是:sorted
+ groupby
:
def cluster_id_key(record):
return record['cluster_id']
def process(data):
sorted_data = sorted(data, key=cluster_id_key)
for cluster_id, records in groupby(sorted_data, key=cluster_id_key):
records = list(records)
common_props = [k for k,v records[0].items() if all(v==r[k] for r in records)]
cluster_data = {k: v for k,v in records[0].items() if k in common_props}
reduced_records = [{k:v for k,v in record.items() if k not in common_props} for record in records]
yield {**cluster_data 'records': reduced_records}
上面的解决方案可以处理以下情况:对于群集中的所有元素,lat
之类的属性可能并不相同。在这种情况下,它会自动将lat
插入records
数组内而不是集群级别。同样,如果所有记录都具有相同的值,则将其放在records
之外。
我将通过练习来对其进行调整,以使其获得所需的准确输出。
答案 1 :(得分:2)
您可以使用临时字典来跟踪相同cluster_id
的记录,并继续将感兴趣的键附加到记录中。
假设您的字典列表存储在变量l
中:
t = {}
for d in l:
if d['cluster_id'] not in t:
t[d['cluster_id']] = {k: d.get(k, []) for k in ('cluster_id', 'lat', 'longi', 'records')}
t[d['cluster_id']]['records'].append({k: d[k] for k in ('id', 'name', 'match_full_address')})
list(t.values())
将返回:
[{'cluster_id': 91,
'lat': 18756.73,
'longi': -97.395351,
'records': [{'id': 1,
'match_full_address': 'Ridge Boulevard,123 Main '
'Street,Branchburg,NJ',
'name': 'Will Smith'},
{'id': 2,
'match_full_address': 'New Castle,123 Mountain '
'Ave,Branchburg,NJ',
'name': 'Sandra Bullock'}]},
{'cluster_id': 92,
'lat': 18756.73,
'longi': -97.395351,
'records': [{'id': 3,
'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ',
'name': 'Tom Cruise'}]}]
答案 2 :(得分:1)
尽管您仍然可以使用理解力,但我认为这不是一个很好的例子。因此,只需简单地重申一下您的列表即可。
#!/usr/bin/env python3
import json
listM = [{
'id': 1,
'name': 'Will Smith',
'match_full_address': 'Ridge Boulevard,123 Main Street,Branchburg,NJ',
'cluster_id': 91,
'lat': 18756.73,
'longi': -97.395351,
},
{
'id': 2,
'name': 'Sandra Bullock',
'match_full_address': 'New Castle,123 Mountain Ave,Branchburg,NJ',
'cluster_id': 91,
'lat': 18756.73,
'longi': -97.395351,
},
{
'id': 3,
'name': 'Tom Cruise',
'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ',
'cluster_id': 92,
'lat': 18756.73,
'longi': -97.395351,
}
]
clusters = dict()
for item in listM:
data = clusters.get(item['cluster_id'], {})
if len(data) == 0:
data["cluster_id"] = item["cluster_id"]
data["lat"] = item["lat"]
data["long"] = item["longi"]
data["records"] = []
data["records"].append(
dict({
'id': item['id'],
'name': item['name'],
'match_full_address': item['match_full_address']
})
)
clusters.update({ item['cluster_id']: data })
print(list(clusters.values()))