你们之前对我的问题非常有帮助 - 请参阅下面的链接。我想要对具有字母数字值的索引进行排序。 我已经运行了这个今天成功但却收到错误的脚本:
/Library/Python/2.7/site-packages/pandas/core/groupby.py:4036: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Traceback (most recent call last)
aggfunc={'sum': np.sum}, fill_value=0)
File "/Library/Python/2.7/site-packages/pandas/core/reshape/pivot.py", line 136, in pivot_table
agged = grouped.agg(aggfunc)
File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 4036, in aggregate
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
追溯到枢轴:
df = df.pivot_table(index=['customer'], columns=['Duration'],
aggfunc={'sum': np.sum},
fill_value=0)
我在此错误之前应用的唯一更改是将计算引入数据框的一个数据列,而不是在SQL语句中运行计算。
新计算:
df['Duration'] = df['Duration']/30
旧分组和聚合:
df = df.pivot_table(index=['customer'], columns=['Duration'],
aggfunc={'sum': np.sum}, fill_value=0)
c = df.columns.levels[1]
c = sorted(ns.natsorted(c), key=lambda x: not x.isdigit())
df = df.reindex_axis(pd.MultiIndex.from_product([df.columns.levels[0], c]), axis=1)
新代码段:
df = df.groupby(['customer', 'Duration']).agg({'sum': np.sum})
c = df.columns.get_level_values(1)
c = sorted(ns.natsorted(c), key=lambda x: not x.isdigit())
df = df.reindex_axis(pd.MultiIndex.from_product([df.columns.levels[0], c]), axis=1)
采用新方法的多指数水平:
MultiIndex(levels=[[u'Invoice A', u'Invoice B', u'Invoice C', u'Invoice B'], [u'0', u'1', u'10', u'11', u'2', u'2Y', u'3', u'3Y', u'4', u'4Y', u'5', u'5Y', u'6', u'7', u'8', u'9', u'9Y']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]], names=['customer', u'Duration'])
分配此c = df.columns.get_level_values(1)
时,收到错误消息:
IndexError: Too many levels: Index has only 1 level, not 2
输入样本:
customer Duration sum
Invoice A 1 1250
Invoice B 2 2000
Invoice B 3 1200
Invoice C 2 10250
Invoice D 3 20500
Invoice D 5 18900
Invoice E 2Y 5000
Invoice F 1 5000
Invoice F 1Y 12100
不确定原因,因为两个级别和名称都有两个级别。
最终结果是按customer
排序的数据框,列按[{1}}排序,显示每个Duration
的{{1}}。另外,我在之前的代码版本中使用pivot的原因是我保留了这种输出格式:
sum
我是在正确的轨道上吗?
答案 0 :(得分:1)
您可以使用instaed agg
函数sum()
,然后按unstack
重新塑造:
import natsort as ns
df = df.groupby(['customer', 'Duration'])['sum'].sum().unstack()
c = sorted(ns.natsorted(df.columns), key=lambda x: not x.isdigit())
df = df.reindex(columns=c)
print (df)
Duration 1 2 3 5 1Y 2Y
customer
Invoice A 1250.0 NaN NaN NaN NaN NaN
Invoice B NaN 2000.0 1200.0 NaN NaN NaN
Invoice C NaN 10250.0 NaN NaN NaN NaN
Invoice D NaN NaN 20500.0 18900.0 NaN NaN
Invoice E NaN NaN NaN NaN NaN 5000.0
Invoice F 5000.0 NaN NaN NaN 12100.0 NaN