我有一个数据集,
data=pd.DataFrame({'id':pd.Series([1,1,1,2,2,3,3,3]),'var1':pd.Series([1,2,3,4,5,6,7,8]),'var2':pd.Series([11,12,13,14,15,16,17,18]),
'var3':pd.Series([21,22,23,24,25,26,27,28])})
这里我需要根据id计算所有列(var1,var2,var3)的分组累积和。 如何根据我的要求编写python代码来输出包装?
提前致谢。
答案 0 :(得分:1)
如果我理解正确,您可以使用DataFrame.groupby
计算按'id'
- 列分组的列的累计总和。类似的东西:
import pandas as pd
data=pd.DataFrame({'id':[1,1,1,2,2,3,3,3],'var1':[1,2,3,4,5,6,7,8],'var2':[11,12,13,14,15,16,17,18], 'var3':[21,22,23,24,25,26,27,28]})
data.groupby('id').apply(lambda x: x.drop('id', axis=1).cumsum(axis=1).sum())
答案 1 :(得分:1)
我不熟悉您使用的pd
对象的身份,但我理解您的问题的方式是您有一个标签列表(在您的代码中表示为id
)长度相等的列表(在代码中表示为var1
,var2
和var3
),并且您希望对共享相同标签的项进行求和,为每个标签执行此操作,并返回结果。
以下代码解决了一般问题(假设您的标签数组已排序):
def cumsum(A):
from operator import add
return reduce(add, A) # cumulative sum of array A
def cumsumlbl(A, lbl):
idx = [lbl.index(item) for item in set(lbl)] # begin index of each lbl subsequence
idx.append(len(lbl)) # last index doesn't get added in the above line
return [cumsum(A[i:j]) for (i,j) in zip(idx[:-1], idx[1:])]
或者使用Markus Jarderot的here代码的修改版本:
def cumsum(A):
from operator import add
return reduce(add, A)
def doublet(iterable):
iterator = iter(iterable)
item = iterator.next()
for next in iterator:
yield (item,next)
item = next
def cumsumlbl(A, lbl):
idx = [lbl.index(item) for item in set(lbl)]
idx.append(len(lbl))
dbl = doublet(idx) # generator for successive, overlapping pairs of indices
return [cumsum(A[i:j]) for (i,j) in dbl]
并测试:
if __name__ == '__main__'
A = [1, 2, 3, 4, 5, 6]
lbl = [1, 1, 2, 2, 2, 3]
print cumsumlbl(A, lbl)
输出:
[3, 12, 6]