Python-基于键/值标识的分组/合并字典

时间:2019-12-03 16:53:28

标签: python list dictionary merge key

我有一个包含许多字典的列表,这些字典具有相同的键但值不同。

我想做的是根据某些键的值对字典进行分组/合并。 相比于尝试解释,显示示例可能更快:

[{'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 3, 'C2': 15},
 {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
 {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 7, 'C2': 12},
 {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}]

因此,我要实现的是合并第一和第三本词典,因为它们具有相同的“区域”,“小时”和“工作日”,将C1和C2中的值相加:

[{'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 10, 'C2': 27},
 {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
 {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}]

这里有帮助吗? :)我已经为此苦苦挣扎了几天,我有一个糟糕的,无法扩展的解决方案,但是我敢肯定,我可以使用更多的Python语言。

谢谢!

4 个答案:

答案 0 :(得分:3)

通过使用defaultdict,您可以在线性时间内合并它们。

from collections import defaultdict

res = defaultdict(lambda : defaultdict(int))

for d in dictionaries:
        res[(d['zone'],d['weekday'],d['hour'])]['C1']+= d['C1']
        res[(d['zone'],d['weekday'],d['hour'])]['C2']+= d['C2']

缺点是您需要重新定义输出才能获得输出。

答案 1 :(得分:2)

我已经着手编写了一个稍长的解决方案,利用nametuples作为字典的键:

from collections import namedtuple

zones = [{'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 3, 'C2': 15},
 {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
 {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 7, 'C2': 12},
 {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}]

ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()

for zone in zones:
    zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
    if zone_time in results:
        results[zone_time]['C1'] += zone['C1']
        results[zone_time]['C2'] += zone['C2']
    else:
        results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}


print(results)

这使用(区域,工作日,小时)的命名元组作为每个字典的键。然后,如果在results中已经存在它,或者在字典中创建一个新条目,则添加它是很简单的。

您绝对可以使它更简短,更“智能”,但是它可能会变得难以理解。

答案 2 :(得分:2)

排序然后按相关键分组;遍历各组并创建具有总和的新字典。

import operator
import itertools

keys = operator.itemgetter('zone','weekday','hour')
c1_c2 = operator.itemgetter('C1','C2')

# data is your list of dicts
data.sort(key=keys)
grouped = itertools.groupby(data,keys)

new_data = []
for (zone,weekday,hour),g in grouped:
    c1,c2 = 0,0
    for d in g:
        c1 += d['C1']
        c2 += d['C2']
    new_data.append({'zone':zone,'weekday':weekday,
                     'hour':hour,'C1':c1,'C2':c2})

最后一个循环也可以写成:

for (zone,weekday,hour),g in grouped:
    cees = map(c1_c2,g)
    c1,c2 = map(sum,zip(*cees))
    new_data.append({'zone':zone,'weekday':weekday,
                     'hour':hour,'C1':c1,'C2':c2})

答案 3 :(得分:1)

编辑:运行时比较

我最初的答案(参见下文)不是一个很好的答案,但是我认为我对其他答案进行了一些运行时分析,对此我做出了有益的贡献,因此我编辑了该部分并将其放在顶部。在这里,我包括其他三个解决方案,以及产生所需输出所需的转换。为了完整起见,我还提供了一个使用pandas的版本,该版本假定用户正在使用DataFrame(从字典列表转换为数据框再转换回它甚至不值得)。比较时间根据生成的随机数据而略有不同,但是它们具有代表性:

>>> run_timer(100)
Times with 100 values
    ...with defaultdict: 0.1496697600000516
    ...with namedtuple: 0.14976404899994122
    ...with groupby: 0.0690777249999428
    ...with pandas: 3.3165711250001095
>>> run_timer(1000)
Times with 1000 values
    ...with defaultdict: 1.267153091999944
    ...with namedtuple: 0.9605341750000207
    ...with groupby: 0.6634409229998255
    ...with pandas: 3.5146895360001054
>>> run_timer(10000)
Times with 10000 values
    ...with defaultdict: 9.194478484000001
    ...with namedtuple: 9.157486462000179
    ...with groupby: 5.18553969300001
    ...with pandas: 4.704001281000046
>>> run_timer(100000)
Times with 100000 values
    ...with defaultdict: 59.644778522000024
    ...with namedtuple: 89.26688319799996
    ...with groupby: 93.3517027989999
    ...with pandas: 14.495209061999958

外带:

  • 使用pandas数据框可以为大型数据集节省大量时间

    • 注意:我包括字典列表和数据框之间的转换,这绝对是重要的
  • 否则,被第二次世界大战接受的解决方案对于中小型数据集是成功的,但是对于非常大的数据集,它可能是最慢的

  • 更改组的大小(例如,通过减少区域数)具有巨大的效果,在此不做研究

这是我用来生成以上代码的脚本。

import random
import pandas

from timeit import timeit

from functools import partial

from itertools import groupby
from operator import itemgetter

from collections import namedtuple, defaultdict


def with_pandas(df):
    return df.groupby(['zone', 'weekday', 'hour']).agg(sum).reset_index()


def with_groupby(data):
    keys = itemgetter('zone', 'weekday', 'hour')

    # data is your list of dicts
    data.sort(key=keys)
    grouped = groupby(data, keys)

    new_data = []
    for (zone, weekday, hour), g in grouped:
        c1, c2 = 0, 0
        for d in g:
            c1 += d['C1']
            c2 += d['C2']
        new_data.append({'zone': zone, 'weekday': weekday,
                         'hour': hour, 'C1': c1, 'C2': c2})

    return new_data


def with_namedtuple(zones):
    ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
    results = dict()
    for zone in zones:
        zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
        if zone_time in results:
            results[zone_time]['C1'] += zone['C1']
            results[zone_time]['C2'] += zone['C2']
        else:
            results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
    return [
        {
            'zone': key[0],
            'weekday': key[1],
            'hour': key[2],
            **val
        }
        for key, val in results.items()
    ]


def with_defaultdict(dictionaries):
    res = defaultdict(lambda: defaultdict(int))
    for d in dictionaries:
        res[(d['zone'], d['weekday'], d['hour'])]['C1'] += d['C1']
        res[(d['zone'], d['weekday'], d['hour'])]['C2'] += d['C2']
    return [
        {
            'zone': key[0],
            'weekday': key[1],
            'hour': key[2],
            **val
        }
        for key, val in res.items()
    ]


def gen_random_vals(num):
    return [
        {
            'zone': random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
            'weekday': random.randint(1, 7),
            'hour': random.randint(0, 23),
            'C1': random.randint(1, 50),
            'C2': random.randint(1, 50),
        }
        for idx in range(num)
    ]


def run_timer(num_vals=1000, timeit_num=1000):
    vals = gen_random_vals(num_vals)
    df = pandas.DataFrame(vals)
    p_fmt = "\t...with %s: %s"
    times = {
        'defaultdict': timeit(stmt=partial(with_defaultdict, vals), number=timeit_num),
        'namedtuple': timeit(stmt=partial(with_namedtuple, vals), number=timeit_num),
        'groupby': timeit(stmt=partial(with_groupby, vals), number=timeit_num),
        'pandas': timeit(stmt=partial(with_pandas, df), number=timeit_num),
    }
    print("Times with %d values" % num_vals)
    for key, val in times.items():
        print(p_fmt % (key, val))

其中

原始答案:

只是为了好玩,这是使用groupby的完全不同的方法。当然,这不是最漂亮的,但是应该很快。

from itertools import groupby
from operator import itemgetter
from pprint import pprint

vals = [
    {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 3, 'C2': 15},
    {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
    {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 7, 'C2': 12},
    {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}
]
ordered = sorted(
    [
        (
            (row['zone'], row['weekday'], row['hour']),
            row['C1'], row['C2']
        )
        for row in vals
    ]
)


def invert_columns(grp):
    return zip(*[g_row[1:] for g_row in grp])


merged = [
    {
        'zone': key[0],
        'weekday': key[1],
        'hour': key[2],
        **dict(
            zip(["C1", "C2"], [sum(col) for col in invert_columns(grp)])
        )
    }
    for key, grp in groupby(ordered, itemgetter(0))
]

pprint(merged)

产生

[{'C1': 10, 'C2': 27, 'hour': 12, 'weekday': 1, 'zone': 'A'},
 {'C1': 5, 'C2': 27, 'hour': 6, 'weekday': 2, 'zone': 'B'},
 {'C1': 2, 'C2': 13, 'hour': 8, 'weekday': 5, 'zone': 'C'}]