我想创建一个数据框,其中列(年,季度,月)和索引(某些属性)是分层的,即多索引。我想总结某些级别,例如,属于一个季度的所有月份的总和。在熊猫中,人们可以通过例如以下一行:
# Axis 1 = columns, level 0 = year, level 1 = quarter
df.sum(axis=1, level=[0, 1]
直到在某些奇怪的情况下,索引仍然无法正确识别,从而触发错误消息No axis named 1 for object type <class 'pandas.core.series.Series'>
。
在下面的代码中,我创建了两个相同的数据帧(两个轴都有多个索引),只有一个区别:df1
在创建时未填充,df2
在创建时直接填充。求和可以与df2
一起使用,但不能与df1
一起使用。我不明白,在后台发生了什么,有人可以请我指出解决方案来理解这种差异吗?
import pandas as pd
import numpy as np
cols = [(y, divmod(m - 1, 3)[0] + 1, m)
for y in list(range(2011, 2014)) for m in list(range(1, 13))]
inds = [(a, b, c)
for a in ["a1", "a2"] for b in ["b1", "b2"] for c in ["c1", "c2"]]
df1 = pd.DataFrame(index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))
df2 = pd.DataFrame(np.ones(df1.shape),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))
for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
df2.loc[ind, col] = entry
try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")
PS:发现了一些提示,df1
中的条目类型为float
,df2
中的条目类型为np.float64
,但这仍无济于事......
答案 0 :(得分:2)
问题df1
中的所有值都是object
s,显然是string
s,但这里是<class 'float'>
:
print (df1.dtypes)
year quarter month
2011 1 1 object
2 object
3 object
2 4 object
5 object
6 object
3 7 object
8 object
9 object
4 10 object
print (df2.dtypes)
year quarter month
2011 1 1 float64
2 float64
3 float64
2 4 float64
5 float64
6 float64
3 7 float64
8 float64
所以铸造工作:
try:
df1.astype(float).sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")
Sum over df1 did work
Sum over df2 did work
for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
print (type(df1.loc[ind, col]))
df2.loc[ind, col] = entry
print (type(df2.loc[ind, col]))
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
最好的是通过numpy数组创建DataFrame
,然后一切都很好用:
df1 = pd.DataFrame(data = np.random.rand(len(inds), len(cols)),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year","quarter","month"]))
try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
Sum over df1 did work