axis()
在上面的数据框中,我想使用熊猫进行数据透视表操作
plot
但是我得到了错误:
cat1 cat2 col_a col_b
0 (34.0, 38.0] (15.9, 47.0] 29 10
1 (34.0, 38.0] (15.9, 47.0] 37 22
2 (28.0, 34.0] (47.0, 56.0] 3 13
3 (34.0, 38.0] (47.0, 56.0] 15 7
4 (28.0, 34.0] (56.0, 67.0] 42 20
5 (28.0, 34.0] (47.0, 56.0] 31 23
6 (28.0, 34.0] (56.0, 67.0] 26 17
7 (28.0, 34.0] (56.0, 67.0] 7 1
8 (28.0, 34.0] (56.0, 67.0] 36 19
9 (19.0, 28.0] (56.0, 67.0] 5 7
10 (19.0, 28.0] (56.0, 67.0] 21 5
11 (28.0, 34.0] (67.0, 84.0] 37 13
pd.pivot_table(df, index='cat1', columns='cat2', values='col_a')
和TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'
的类型都是int32,而col_a
和col_b
的类型是分类的。如何摆脱这个错误?
答案 0 :(得分:4)
这是一个与间隔为间隔的列相关的错误(请参见GH25814),并将在v0.25版中修复。
另请参见使用crosstab
的相关问题:Pandas crosstab on CategoricalDType columns throws TypeError
同时,这里有一些选择。要进行汇总,您将必须使用pivot_table
并将分类列转换为字符串才能进行透视。
df2 = df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
# to aggregate by taking the mean of col_a
df2.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean')
这里的警告是,您失去了索引和列为间隔的好处。
另一个选择是绕过分类代码,然后重新分配类别:
df2 = df.assign(cat1=df['cat1'].cat.codes, cat2=df['cat2'].cat.codes)
pivot = df2.pivot_table(
index='cat1', columns='cat2', values='col_a', aggfunc='mean')
pivot.index = df['cat1'].cat.categories
pivot.columns = df['cat2'].cat.categories
该分配之所以有效,是因为pivot_table
预先对间隔进行了排序。
最小代码示例
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'cat1': np.random.choice(100, 10),
'cat2': np.random.choice(100, 10),
'col_a': np.random.randint(1, 50, 10)})
df['cat1'] = pd.cut(df['cat1'], bins=np.arange(0, 101, 10))
df['cat2'] = pd.cut(df['cat2'], bins=np.arange(0, 101, 10))
df
A B C
0 (40, 50] (60, 70] 18
1 (40, 50] (80, 90] 38
2 (60, 70] (80, 90] 26
3 (60, 70] (10, 20] 14
4 (60, 70] (50, 60] 9
5 (0, 10] (60, 70] 10
6 (80, 90] (30, 40] 21
7 (20, 30] (80, 90] 17
8 (30, 40] (40, 50] 6
9 (80, 90] (80, 90] 16
(df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean'))
cat2 (10, 20] (30, 40] (40, 50] (50, 60] (60, 70] (80, 90]
cat1
(0, 10] NaN NaN NaN NaN 10.0 NaN
(20, 30] NaN NaN NaN NaN NaN 17.0
(30, 40] NaN NaN 6.0 NaN NaN NaN
(40, 50] NaN NaN NaN NaN 18.0 38.0
(60, 70] 14.0 NaN NaN 9.0 NaN 26.0
(80, 90] NaN 21.0 NaN NaN NaN 16.0