合并并创建具有相同ID的所有记录的新JSON数组

时间:2018-10-12 11:01:10

标签: python json python-3.x

我必须合并并创建一个字典列表中所有具有相同cluster_id的记录的JSON数组。例如:id:1和2具有相同的cluster_id字段,因此应按预期输出所示进行合并,并合并3个字段id,name,match_full_address对于新的字段记录应显示为JSON数组,对于ID为3的单例记录应相同

我的词典列表:

[{
    'id': 1,
    'name': 'Will Smith',
    'match_full_address': 'Ridge Boulevard,123 Main Street,Branchburg,NJ',
    'cluster_id': 91,
    'lat': 18756.73,
    'longi': -97.395351,
},
{
    'id': 2,
    'name': 'Sandra Bullock',
    'match_full_address': 'New Castle,123 Mountain Ave,Branchburg,NJ',
    'cluster_id': 91,
    'lat': 18756.73,
    'longi': -97.395351,
},
{
    'id': 3,
    'name': 'Tom Cruise',
    'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ',
    'cluster_id': 92,
    'lat': 18756.73,
    'longi': -97.395351,
}
]

预期产量

[{
    'cluster_id': 91,
    'lat': 18756.73,
    'longi': -97.395351,
        'records': [{'id': 1,
    'name': 'Will Smith',
    'match_full_address': 'Ridge Boulevard,123 Main Street,Branchburg,NJ'},
    {'id': 2,
    'name': 'Sandra Bullock',
    'match_full_address': 'New Castle,123 Mountain Ave,Branchburg,NJ'}]
},
{
    'cluster_id': 92,
    'lat': 18756.73,
    'longi': -97.395351,
 'records': [{  'id': 3,
    'name': 'Tom Cruise',
    'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ'}
}
]

3 个答案:

答案 0 :(得分:2)

这类问题很常见。答案始终是:sorted + groupby

def cluster_id_key(record):
    return record['cluster_id']

def process(data):
    sorted_data = sorted(data, key=cluster_id_key)
    for cluster_id, records in groupby(sorted_data, key=cluster_id_key):
        records = list(records)
        common_props = [k for k,v records[0].items() if all(v==r[k] for r in records)]
        cluster_data = {k: v for k,v in records[0].items() if k in common_props}
        reduced_records = [{k:v for k,v in record.items() if k not in common_props} for record in records]
        yield {**cluster_data 'records': reduced_records}

上面的解决方案可以处理以下情况:对于群集中的所有元素,lat之类的属性可能并不相同。在这种情况下,它会自动将lat插入records数组内而不是集群级别。同样,如果所有记录都具有相同的值,则将其放在records之外。

我将通过练习来对其进行调整,以使其获得所需的准确输出。

答案 1 :(得分:2)

您可以使用临时字典来跟踪相同cluster_id的记录,并继续将感兴趣的键附加到记录中。

假设您的字典列表存储在变量l中:

t = {}
for d in l:
    if d['cluster_id'] not in t:
        t[d['cluster_id']] = {k: d.get(k, []) for k in ('cluster_id', 'lat', 'longi', 'records')}
    t[d['cluster_id']]['records'].append({k: d[k] for k in ('id', 'name', 'match_full_address')})

list(t.values())将返回:

[{'cluster_id': 91,
  'lat': 18756.73,
  'longi': -97.395351,
  'records': [{'id': 1,
               'match_full_address': 'Ridge Boulevard,123 Main '
                                     'Street,Branchburg,NJ',
               'name': 'Will Smith'},
              {'id': 2,
               'match_full_address': 'New Castle,123 Mountain '
                                     'Ave,Branchburg,NJ',
               'name': 'Sandra Bullock'}]},
 {'cluster_id': 92,
  'lat': 18756.73,
  'longi': -97.395351,
  'records': [{'id': 3,
               'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ',
               'name': 'Tom Cruise'}]}]

答案 2 :(得分:1)

尽管您仍然可以使用理解力,但我认为这不是一个很好的例子。因此,只需简单地重申一下您的列表即可。

#!/usr/bin/env python3
import json


listM = [{
    'id': 1,
    'name': 'Will Smith',
    'match_full_address': 'Ridge Boulevard,123 Main Street,Branchburg,NJ',
    'cluster_id': 91,
    'lat': 18756.73,
    'longi': -97.395351,
},
{
    'id': 2,
    'name': 'Sandra Bullock',
    'match_full_address': 'New Castle,123 Mountain Ave,Branchburg,NJ',
    'cluster_id': 91,
    'lat': 18756.73,
    'longi': -97.395351,
},
{
    'id': 3,
    'name': 'Tom Cruise',
    'match_full_address': 'MI2, 123 Syracuse Avenue, Branchburg,NJ',
    'cluster_id': 92,
    'lat': 18756.73,
    'longi': -97.395351,
}
]

clusters = dict()
for item in listM:
    data = clusters.get(item['cluster_id'], {})
    if len(data) == 0:
        data["cluster_id"] = item["cluster_id"]
        data["lat"] = item["lat"]
        data["long"] = item["longi"]
        data["records"] = []

    data["records"].append(
        dict({
            'id': item['id'],
            'name': item['name'],
            'match_full_address': item['match_full_address']
            })
        )
    clusters.update({ item['cluster_id']: data })

print(list(clusters.values()))