我有一个包含多列的列表,我需要根据两列对行进行分组和求和。我可以不使用Pandas数据框来做到这一点吗?
我在这样的列表中有一个数据集:
User Days Project
Dave 3 Red
Dave 4 Red
Dave 2 Blue
Sue 4 Red
Sue 1 Red
Sue 3 Yellow
具体来说:
[[Dave, 3, Red], [Dave, 4, Red], [Dave, 2, Blue], [Sue, 4, Red], [Sue, 1, Red], [Sue, 3, Yellow]]
我想做的是在同一行上输出某些总计,例如:
User Days Project UserDays ProjectDaysPerUser
Dave 3 Red 9 7
Dave 4 Red 9 7
Dave 2 Blue 9 2
Sue 4 Red 8 5
Sue 1 Red 8 5
Sue 3 Yellow 8 3
因此,我尝试进行两次分组,以首先按用户,然后按项目获得“ ProjectDaysPerUser ”。正是这种双重分组使我不知所措。
是否有一种简单的方法可以在不创建Panda数据帧的情况下进行操作?
答案 0 :(得分:1)
下面的脚本使用groupby并将总和的结果附加到列表中。
from itertools import groupby
data = [['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
new_data, final = [], []
userDays=[[k, sum(v[1] for v in g)] for k, g in groupby(data, key = lambda x: x[0])]
projuserDays=[[k, sum(v[1] for v in g)] for k, g in groupby(data, key = lambda x: (x[0], x[2]))]
#add userDays and projectuserdays
for d in data:
for u in userDays:
if d[0]==u[0]:
d.append(u[1])
new_data.append(d)
for p in projuserDays:
if d[0]==p[0][0] and d[2]==p[0][1]:
d.append(p[1])
final.append(d)
print(final)
Result:
[['Dave', 3, 'Red', 9, 7],
['Dave', 4, 'Red', 9, 7],
['Dave', 2, 'Blue', 9, 2],
['Sue', 4, 'Red', 8, 5],
['Sue', 1, 'Red', 8, 5],
['Sue', 3, 'Yellow', 8, 3]]
答案 1 :(得分:1)
使用字典提高性能
data = [['Dave', 3, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Dave', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
sum_dict = {}
for d in data:
sum_dict[d[0]] = sum_dict.get(d[0], 0) + d[1]
sum_dict[(d[0], d[2])] = sum_dict.get((d[0], d[2]), 0) + d[1]
for d in data:
d.append(sum_dict[d[0]])
d.append(sum_dict[(d[0], d[2])])
print(d)
答案 2 :(得分:0)
由于您正在求和,因此也可以使用collections.Counter
很好地解决:
from collections import Counter
data = [['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
user_days = Counter()
project_user_days = Counter()
for (name, num_days, project) in data:
user_days[name] += num_days
project_user_days[(name, project)] += num_days
derived_data = [
[name, num_days, project, user_days[name], project_user_days[(name, project)]]
for (name, num_days, project) in data
]
import pprint
pprint.pprint(derived_data)
# [['Dave', 3, 'Red', 9, 7],
# ['Dave', 4, 'Red', 9, 7],
# ['Dave', 2, 'Blue', 9, 2],
# ['Sue', 4, 'Red', 8, 5],
# ['Sue', 1, 'Red', 8, 5],
# ['Sue', 3, 'Yellow', 8, 3]]