我想计算每个多指数子级的总和。然后,将其保存在数据框中。
我当前的数据框如下:
values
first second
bar one 0.106521
two 1.964873
baz one 1.289683
two -0.696361
foo one -0.309505
two 2.890406
qux one -0.758369
two 1.302628
所需的结果是:
values
first second
bar one 0.106521
two 1.964873
total 2.071394
baz one 1.289683
two -0.696361
total 0.593322
foo one -0.309505
two 2.890406
total 2.580901
qux one -0.758369
two 1.302628
total 0.544259
total one 0.328331
two 5.461546
total 5.789877
目前我发现下面的实现有效。但我想知道是否有更好的选择。我需要尽可能快的解决方案,因为在某些情况下,当我的数据帧变得庞大时,计算时间似乎需要很长时间。
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: index = MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [4]: s = Series(randn(8), index=index)
In [5]: d = {'values': s}
In [6]: df = DataFrame(d)
In [7]: for col in df.index.names:
.....: df = df.unstack(col)
.....: df[('values', 'total')] = df.sum(axis=1)
.....: df = df.stack()
.....:
答案 0 :(得分:1)
不确定您是否仍在寻找答案-假设您当前的数据帧已分配给df
,您可以尝试类似的方法:
temp = df.pivot(index='first', columns='second', values='values')
temp['total'] = temp['one'] + temp['two']
temp.stack()
答案 1 :(得分:0)
相当难看的代码:
In [162]:
print df
values
first second
bar one 0.370291
two 0.750565
baz one 0.148405
two 0.919973
foo one 0.121964
two 0.394017
qux one 0.883136
two 0.871792
In [163]:
print pd.concat((df.reset_index(),
df.reset_index().groupby('first').aggregate('sum').reset_index())).\
sort(['first','second']).\
fillna('total').\
set_index(['first','second'])
values
first second
bar one 0.370291
two 0.750565
total 1.120856
baz one 0.148405
two 0.919973
total 1.068378
foo one 0.121964
two 0.394017
total 0.515981
qux one 0.883136
two 0.871792
total 1.754927
基本上,由于需要计算额外的行' total'并将其插入到原始数据帧中,因此它不会是原始数据与结果之间的一对一关系,也不是这种关系是多对一的。所以,我认为你必须产生总数'数据框是单独的,concat
是原始数据帧。
答案 2 :(得分:0)
我知道这是一个古老的话题,但是-我找不到任何令人满意的解决方法可以在大熊猫中卷起来,而实际上我可以看到其中的一些价值。
#to retain original index:
index_cols=df.index.names
df2=pd.DataFrame()
#we iterate over each sub index, except the last one - to get sub-sums
for i in range(-1,len(df.index[0])-1):
if i>=0:
df2=df2.append(df.sum(level=list(range(i+1))).reset_index(), ignore_index=True)
else: #-1 will handle the total sum
df2=df2.append(df.sum(), ignore_index=True)
#to mask the last index, for which the sub-sum was not calculated:
df2[index_cols[-1]]=np.nan
#might be done better- you can keep it as "nan" (you would comment out the below line then), which will force it to the last position in index, after sorting, or put some special character in front
df2[index_cols]=df2[index_cols].fillna("_total")
df=df.reset_index().append(df2, sort=True).set_index(index_cols).sort_values(index_cols, ascending=False)
对于我的示例数据:
values
first second
qux two -4.0
one 2.0
_total -2.0
foo two -3.0
one 4.0
_total 1.0
baz two 5.0
one -1.0
_total 4.0
bar two -1.0
one 2.0
_total 1.0
_total _total 4.0