对多索引熊猫列进行求和

时间:2017-06-13 06:47:04

标签: python pandas pandas-groupby

我想创建一个数据框,其中列(年,季度,月)和索引(某些属性)是分层的,即多索引。我想总结某些级别,例如,属于一个季度的所有月份的总和。在熊猫中,人们可以通过例如以下一行:

# Axis 1 = columns, level 0 = year, level 1 = quarter
df.sum(axis=1, level=[0, 1]

直到在某些奇怪的情况下,索引仍然无法正确识别,从而触发错误消息No axis named 1 for object type <class 'pandas.core.series.Series'>

在下面的代码中,我创建了两个相同的数据帧(两个轴都有多个索引),只有一个区别:df1在创建时未填充,df2在创建时直接填充。求和可以与df2一起使用,但不能与df1一起使用。我不明白,在后台发生了什么,有人可以请我指出解决方案来理解这种差异吗?

import pandas as pd
import numpy as np

cols = [(y, divmod(m - 1, 3)[0] + 1, m)
        for y in list(range(2011, 2014)) for m in list(range(1, 13))]

inds = [(a, b, c)
        for a in ["a1", "a2"] for b in ["b1", "b2"] for c in ["c1", "c2"]]

df1 = pd.DataFrame(index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
                   columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))

df2 = pd.DataFrame(np.ones(df1.shape),
                   index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
                   columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))

for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
    entry = np.random.rand()
    df1.loc[ind, col] = entry
    df2.loc[ind, col] = entry

try:
    df1.sum(axis=1, level=[0, 1])
    print("Sum over df1 did work")
except:
    print("Sum over df1 did not work...")

try:
    df2.sum(axis=1, level=[0, 1])
    print("Sum over df2 did work")
except:
    print("Sum over df2 did not work...")

PS:发现了一些提示,df1中的条目类型为floatdf2中的条目类型为np.float64,但这仍无济于事......

1 个答案:

答案 0 :(得分:2)

问题df1中的所有值都是object s,显然是string s,但这里是<class 'float'>

print (df1.dtypes)
year  quarter  month
2011  1        1        object
               2        object
               3        object
      2        4        object
               5        object
               6        object
      3        7        object
               8        object
               9        object
      4        10       object

print (df2.dtypes)
year  quarter  month
2011  1        1        float64
               2        float64
               3        float64
      2        4        float64
               5        float64
               6        float64
      3        7        float64
               8        float64

所以铸造工作:

try:
    df1.astype(float).sum(axis=1, level=[0, 1])
    print("Sum over df1 did work")
except:
    print("Sum over df1 did not work...")

try:
    df2.sum(axis=1, level=[0, 1])
    print("Sum over df2 did work")
except:
    print("Sum over df2 did not work...")
Sum over df1 did work
Sum over df2 did work
for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
    entry = np.random.rand()
    df1.loc[ind, col] = entry
    print (type(df1.loc[ind, col]))
    df2.loc[ind, col] = entry
    print (type(df2.loc[ind, col]))

<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>

最好的是通过numpy数组创建DataFrame,然后一切都很好用:

df1 = pd.DataFrame(data = np.random.rand(len(inds), len(cols)),
                   index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
                   columns=pd.MultiIndex.from_tuples(cols, names=["year","quarter","month"]))


try:
    df1.sum(axis=1, level=[0, 1])
    print("Sum over df1 did work")
except:
    print("Sum over df1 did not work...")
Sum over df1 did work