我正在尝试以一种我可以使用的格式自动从熊猫中获取输出,以尽量减少在文字处理器中造成的混乱。我将描述性统计信息用作练习案例,因此尝试使用df[variable].describe()
的输出。我的问题是,.describe()
的响应取决于该列的dtype
(如果我正确理解的话)。
对于数字列describe()
,将产生以下输出:
count 306.000000
mean 36.823529
std 6.308587
min 10.000000
25% 33.000000
50% 37.000000
75% 41.000000
max 50.000000
Name: gses_tot, dtype: float64
但是,对于分类列,它会产生:
count 306
unique 3
top Female
freq 166
Name: gender, dtype: object
由于这种差异,我需要不同的代码来捕获所需的信息,但是,我似乎无法使我的代码在分类变量上工作。
我尝试了几种不同的版本:
for v in df.columns:
if df[v].dtype.name == 'category': #i've also tried 'object' here
c, u, t, f, = df[v].describe()
print(f'******{str(v)}******')
print(f'Largest category = {t}')
print(f'Percentage = {(f/c)*100}%')
else:
c, m, std, mi, tf, f, sf, ma, = df[v].describe()
print(f'******{str(v)}******')
print(f'M = {m}')
print(f'SD = {std}')
print(f'Range = {float(ma) - float(mi)}')
print(f'\n')
else
块中的代码可以正常工作,但是当我进入一个分类列时,我得到下面的错误
******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-f077cc105185> in <module>
6 print(f'Percentage = {(f/c)*100}')
7 else:
----> 8 c, m, std, mi, tf, f, sf, ma, = df[v].describe()
9 print(f'******{str(v)}******')
10 print(f'M = {m}')
ValueError: not enough values to unpack (expected 8, got 4)
我想发生的事情是
******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0
******gender******
Largest category = female
Percentage = 52.2%
I believe that the issue is how I'm setting up the if statement with the dtype
and I've rooted around to try to find out how to access the dtype properly but I can't seem to make it work.
Advice would be much appreciated.
答案 0 :(得分:1)
您可以检查describe输出中包括哪些字段并打印相应的部分:
import pandas as pd
df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
for v in df.columns:
desc = df[v].describe()
print(f'******{str(v)}******')
if 'top' in desc:
print(f'Largest category = {desc["top"]}')
print(f'Percentage = {(desc["freq"]/desc["count"])*100:.1f}%')
else:
print(f'M = {desc["mean"]}')
print(f'SD = {desc["std"]}')
print(f'Range = {float(desc["max"]) - float(desc["min"])}')