将熊猫数据透视表与间隔列一起使用会导致TypeError

时间:2019-06-19 12:43:20

标签: python pandas dataframe pivot-table

axis()

在上面的数据框中,我想使用熊猫进行数据透视表操作

plot

但是我得到了错误:

      cat1             cat2                       col_a             col_b
0    (34.0, 38.0]    (15.9, 47.0]             29               10
1    (34.0, 38.0]    (15.9, 47.0]             37               22
2    (28.0, 34.0]    (47.0, 56.0]              3               13
3    (34.0, 38.0]    (47.0, 56.0]             15                7
4    (28.0, 34.0]    (56.0, 67.0]             42               20
5    (28.0, 34.0]    (47.0, 56.0]             31               23
6    (28.0, 34.0]    (56.0, 67.0]             26               17
7    (28.0, 34.0]    (56.0, 67.0]              7                1
8    (28.0, 34.0]    (56.0, 67.0]             36               19
9    (19.0, 28.0]    (56.0, 67.0]              5                7
10   (19.0, 28.0]    (56.0, 67.0]             21                5
11   (28.0, 34.0]    (67.0, 84.0]             37               13

pd.pivot_table(df, index='cat1', columns='cat2', values='col_a') TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe' 的类型都是int32,而col_acol_b的类型是分类的。如何摆脱这个错误?

1 个答案:

答案 0 :(得分:4)

这是一个与间隔为间隔的列相关的错误(请参见GH25814),并将在v0.25版中修复。 另请参见使用crosstab的相关问题:Pandas crosstab on CategoricalDType columns throws TypeError

同时,这里有一些选择。要进行汇总,您将必须使用pivot_table并将分类列转换为字符串才能进行透视。

df2 = df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
# to aggregate by taking the mean of col_a
df2.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean')

这里的警告是,您失去了索引和列为间隔的好处。

另一个选择是绕过分类代码,然后重新分配类别:

df2 = df.assign(cat1=df['cat1'].cat.codes, cat2=df['cat2'].cat.codes)
pivot = df2.pivot_table(
    index='cat1', columns='cat2', values='col_a', aggfunc='mean')

pivot.index = df['cat1'].cat.categories
pivot.columns = df['cat2'].cat.categories

该分配之所以有效,是因为pivot_table预先对间隔进行了排序。


最小代码示例

import pandas as pd
import numpy as np

np.random.seed(0)

df = pd.DataFrame({
    'cat1': np.random.choice(100, 10), 
    'cat2': np.random.choice(100, 10), 
    'col_a': np.random.randint(1, 50, 10)})

df['cat1'] = pd.cut(df['cat1'], bins=np.arange(0, 101, 10))
df['cat2'] = pd.cut(df['cat2'], bins=np.arange(0, 101, 10))

df
          A         B   C
0  (40, 50]  (60, 70]  18
1  (40, 50]  (80, 90]  38
2  (60, 70]  (80, 90]  26
3  (60, 70]  (10, 20]  14
4  (60, 70]  (50, 60]   9
5   (0, 10]  (60, 70]  10
6  (80, 90]  (30, 40]  21
7  (20, 30]  (80, 90]  17
8  (30, 40]  (40, 50]   6
9  (80, 90]  (80, 90]  16

(df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
   .pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean'))

cat2      (10, 20]  (30, 40]  (40, 50]  (50, 60]  (60, 70]  (80, 90]
cat1                                                                
(0, 10]        NaN       NaN       NaN       NaN      10.0       NaN
(20, 30]       NaN       NaN       NaN       NaN       NaN      17.0
(30, 40]       NaN       NaN       6.0       NaN       NaN       NaN
(40, 50]       NaN       NaN       NaN       NaN      18.0      38.0
(60, 70]      14.0       NaN       NaN       9.0       NaN      26.0
(80, 90]       NaN      21.0       NaN       NaN       NaN      16.0