if-else按列dtype在熊猫中

时间:2019-10-04 20:16:58

标签: python pandas

格式化熊猫输出

我正在尝试以一种我可以使用的格式自动从熊猫中获取输出,以尽量减少在文字处理器中造成的混乱。我将描述性统计信息用作练习案例,因此尝试使用df[variable].describe()的输出。我的问题是,.describe()的响应取决于该列的dtype(如果我正确理解的话)。

对于数字列describe(),将产生以下输出:

count    306.000000
mean      36.823529
std        6.308587
min       10.000000
25%       33.000000
50%       37.000000
75%       41.000000
max       50.000000
Name: gses_tot, dtype: float64

但是,对于分类列,它会产生:

count        306
unique         3
top       Female
freq         166
Name: gender, dtype: object

由于这种差异,我需要不同的代码来捕获所需的信息,但是,我似乎无法使我的代码在分类变量上工作。

我尝试过的

我尝试了几种不同的版本:

for v in df.columns:
    if df[v].dtype.name == 'category': #i've also tried 'object' here
        c, u, t, f, = df[v].describe()
        print(f'******{str(v)}******')
        print(f'Largest category = {t}')
        print(f'Percentage = {(f/c)*100}%')        
    else:
        c, m, std, mi, tf, f, sf, ma, = df[v].describe()
        print(f'******{str(v)}******')
        print(f'M = {m}')
        print(f'SD = {std}')
        print(f'Range = {float(ma) - float(mi)}')
        print(f'\n')

else块中的代码可以正常工作,但是当我进入一个分类列时,我得到下面的错误

******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-f077cc105185> in <module>
      6         print(f'Percentage = {(f/c)*100}')
      7     else:
----> 8         c, m, std, mi, tf, f, sf, ma, = df[v].describe()
      9         print(f'******{str(v)}******')
     10         print(f'M = {m}')

ValueError: not enough values to unpack (expected 8, got 4)

我想发生的事情是

******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0


******gender******
Largest category = female
Percentage = 52.2%


I believe that the issue is how I'm setting up the if statement with the dtype
and I've rooted around to try to find out how to access the dtype properly but I can't seem to make it work. 

Advice would be much appreciated.

1 个答案:

答案 0 :(得分:1)

您可以检查describe输出中包括哪些字段并打印相应的部分:

import pandas as pd

df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})

for v in df.columns:
    desc = df[v].describe()
    print(f'******{str(v)}******')
    if 'top' in desc:
        print(f'Largest category = {desc["top"]}')
        print(f'Percentage = {(desc["freq"]/desc["count"])*100:.1f}%')        
    else:
        print(f'M = {desc["mean"]}')
        print(f'SD = {desc["std"]}')
        print(f'Range = {float(desc["max"]) - float(desc["min"])}')