在python中分组和求和相似值

时间:2018-07-09 09:56:22

标签: python dictionary counter defaultdict

我有以下格式的数据:

d = [
 {'key': '2018-05-10', 'vals': {'Clicks': 229, 'Link Clicks': 210}},
 {'key': '2018-05-11', 'vals': {'Clicks': 365, 'Link Clicks': 379}},

 {'key': '2018-05-10', 'vals': {'Clicks': 139, 'Link Clicks': 11}},
 {'key': '2018-05-11', 'vals': {'Clicks': 1348, 'Link Clicks': 73}},

]

即,它具有多个具有相同key

的条目

我希望它进行分组,以便将ClicksLink Clicks汇总为共同的日期:

所以输出应该像这样:

d = [
 {'key': '2018-05-10', 'vals': {'Clicks': 368, 'Link Clicks': 221}},
 {'key': '2018-05-11', 'vals': {'Clicks': 1713, 'Link Clicks': 452}},
]

我想到了首先使用defaultdict将值分组在一起的方法:

from collections import defaultdict

    dd = defaultdict(list)

    for i in d:                        
        dd[i['key']].append(i['vals'])

给出以下输出:

{ 2018-05-10': [
             {'Clicks': 229, 'Link Clicks': 210},
             {'Clicks': 139, 'Link Clicks': 11}
              ],
 '2018-05-11': [
             {'Clicks': 365, 'Link Clicks': 379},
             {'Clicks': 1348, 'Link Clicks': 73}
             ]}

现在,我想我可以使用Counter来汇总值,但是我知道它该怎么做。同样,键名,即ClicksLink Clicks可能会更改,并且vals可以包含2个以上的条目。

还可以不使用defaultdict来完成吗?有更好的方法吗?

注意:我认为使用这种defaultdict方法并不理想,因为我一直希望按日期对数据进行排序,而一旦我使用dict,我将立即放弃订单

8 个答案:

答案 0 :(得分:3)

from pprint import pprint
from collections import Counter, OrderedDict

d = {
'2018-05-10': [
             {'Clicks': 229, 'Link Clicks': 210},
             {'Clicks': 139, 'Link Clicks': 11}
              ],
 '2018-05-11': [
             {'Clicks': 365, 'Link Clicks': 379},
             {'Clicks': 1348, 'Link Clicks': 73}
             ],
}

m = OrderedDict()
for k, v in d.items():
    m[k] = Counter()
    for i in v:
        m[k].update(i)
    m[k] = dict(m[k])
    # or if you want to keep the 'vals' key and list:
    # m[k] = [{"vals": dict(m[k])}]

pprint(m)

输出:

OrderedDict([('2018-05-11', {'Clicks': 1713, 'Link Clicks': 452}),
             ('2018-05-10', {'Clicks': 368, 'Link Clicks': 221})])

答案 1 :(得分:2)

您可以使用嵌套词典理解。相关的c_type键,即ClicksLink Clicks,是从每个日期的第一个列表中得出的。否则,该方法自然会接受任意数量的类别。

res = {k: {'vals': {c_type: sum(item[c_type] for item in v) for c_type in v[0]}}
       for k, v in dd.items()}

{'2018-05-10': {'vals': {'Clicks': 368, 'Link Clicks': 221}},
 '2018-05-11': {'vals': {'Clicks': 1713, 'Link Clicks': 452}}}

答案 2 :(得分:2)

我建议不要将字典的输出格式作为字典的列表,在字典中每个字典都只有键(keyvals),您应该只使用实际的{key: vals}字典对!

这使代码更简洁,更具可读性,并且使访问特定日期变得更加整洁,因为您无需循环浏览列表(O(n)),您可以直接访问该日期并获得点击次数

例如,

dates = {}
for dd in d:
    dates.setdefault(dd['key'], []).append(dd['vals'])

dates = {k: {kk:sum(dd[kk] for dd in v) for kk in v[0].keys()} \
                                        for k,v in dates.items()}

给出:

{
  "2018-05-10": {
    "Clicks": 368,
    "Link Clicks": 221
  },
  "2018-05-11": {
    "Clicks": 1713,
    "Link Clicks": 452
  }
}

现在,您可以使用以下类似的方法直接获取特定日期的数据:

dates['2018-05-11']['Clicks']
#1713

如果您需要按日期排序的字典列表,那么我们可以使用当前字典并在原始数据中为每个日期建立索引,因为看起来似乎已经被排序了:

order = [dd['key'] for dd in d]
date_list = sorted([{'key':k,'vals':v} for k,v in dates.items()], \
                                       key=lambda dd: order.index(dd['key']))

date_list作为按日期排序的列表:

[
  {
    "key": "2018-05-10",
    "vals": {
      "Clicks": 368,
      "Link Clicks": 221
    }
  },
  {
    "key": "2018-05-11",
    "vals": {
      "Clicks": 1713,
      "Link Clicks": 452
    }
  }
]

答案 3 :(得分:1)

我们可以将其概括为基本的“组折叠”方法:

from operator import add, itemgetter

def group_fold(data, fold=add, key=itemgetter('key'), vals=itemgetter('vals')):
    result = {}
    for entry in data:
        ky = key(entry)
        vlb = vals(entry)
        vla = result.get(ky, None)
        if vla:
            for subk, subv in vl.items():
                if subk in vla:
                    vla[subk] = fold(vla[subk], subv)
                else:
                    vla[subk] = subv
        else:
            result[ky] = dict(vlb)
    return result

因此,我们现在可以将其用作group_fold(d),但是我们可以自定义折叠功能,例如,将折叠功能自定义为mul,而不是add

from operator import mul

group_fold(d, fold=mul)

答案 4 :(得分:1)

from collections import defaultdict, Counter, OrderedDict
ld = [{'key': '2018-05-10', 'vals': {'Clicks': 229, 'Link Clicks': 210}}, {'key': '2018-05-11', 'vals': {'Clicks': 365, 'Link Clicks': 379}}, {'key': '2018-05-10', 'vals': {'Clicks': 139, 'Link Clicks': 11}}, {'key': '2018-05-11', 'vals': {'Clicks': 1348, 'Link Clicks': 73}}]
out=defaultdict(Counter())
for d in ld:
    out[d['key']].update(d['vals'])

new = OrderedDict(sorted(out.items()))
print(new)
# OrderedDict([('2018-05-10', Counter({'Clicks': 368, 'Link Clicks': 221})), ('2018-05-11', Counter({'Clicks': 1713, 'Link Clicks': 452}))])

答案 5 :(得分:1)

尝试此解决方案:

d = [
{'key': '2018-05-10', 'vals': {'Clicks': 229, 'Link Clicks': 210}},
{'key': '2018-06-01', 'vals': {'Clicks': 365, 'Link Clicks': 379}},

{'key': '2018-05-10', 'vals': {'Clicks': 139, 'Link Clicks': 11}},
{'key': '2018-06-01', 'vals': {'Clicks': 1348, 'Link Clicks': 73}},

]

final_dict = {}

for doc in d:
    date = doc['key']

    if date not in final_dict:
        final_dict[date] = {}

        for key in doc['vals']:
            final_dict[date][key] = doc['vals'][key]

    else:

        for key in doc['vals']:
            final_dict[date][key] += doc['vals'][key]


resp_dict = [{date: final_dict[date]} for date in sorted(final_dict)]

print resp_dict

答案 6 :(得分:0)

使用嵌套的defaultdict:

result = defaultdict(lambda: defaultdict(int))
for entry in d:
  for key, val in entry['vals'].items():
    result[entry['key']][key] += val

它将为您提供以下结果:

{"2018-05-10": {"Clicks": 368, "Link Clicks": 221}, "2018-05-11": {"Clicks": 1713, "Link Clicks": 452}}

答案 7 :(得分:0)

使用itertools.groupby

d =  [
 {'key': '2018-05-10', 'vals': {'Clicks': 368, 'Link Clicks': 221}},
 {'key': '2018-05-11', 'vals': {'Clicks': 1713, 'Link Clicks': 452}},
]

from itertools import groupby
from operator import itemgetter
newdict={}
for dt, k in groupby(sorted(d,key=itemgetter('key')),key=itemgetter('key')):
    for d in k:
        newdict[dt]=d['vals']

输出:

{'2018-05-10': {'Clicks': 368, 'Link Clicks': 221},
 '2018-05-11': {'Clicks': 1713, 'Link Clicks': 452}}