我正在为机器学习算法生成一些功能,我想从数据框计算一些统计数据,类似于describe()
。
以下是示例代码:
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})
print(df)
df_t = df.describe()
print(type(df_t))
print(df_t)
print(df_t.columns)
print(df_t.index)
输出:
A B
0 1.0 20
1 NaN 30
2 3.0 40
<class 'pandas.core.frame.DataFrame'>
A B
count 2.000000 3.0
mean 2.000000 30.0
std 1.414214 10.0
min 1.000000 20.0
25% 1.500000 25.0
50% 2.000000 30.0
75% 2.500000 35.0
max 3.000000 40.0
Index(['A', 'B'], dtype='object')
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
以下是问题:
如何将describe
的结果重新整形为一行,其名称为A_count,A_mean,...,B_75%,B_max
?
使用自定义函数而不是describe
执行相同操作的最佳方法是什么,例如我想添加np.median
和np.percentile
20%和80 %。
答案 0 :(得分:2)
要使用stack
In [11]: df_s = df_t.stack()
In [12]: df_s.index = df_s.index.map("_".join)
In [13]: df_s
Out[13]:
count_A 2.000000
count_B 3.000000
mean_A 2.000000
mean_B 30.000000
std_A 1.414214
std_B 10.000000
min_A 1.000000
min_B 20.000000
25%_A 1.500000
25%_B 25.000000
50%_A 2.000000
50%_B 30.000000
75%_A 2.500000
75%_B 35.000000
max_A 3.000000
max_B 40.000000
dtype: float64
虽然......目前还不清楚你为什么要这样做(你可能不会)。
您可以将percentile
参数传递给describe
:
In [21]: df.describe(percentiles=[0.2, 0.8])
Out[21]:
A B
count 2.000000 3.0
mean 2.000000 30.0
std 1.414214 10.0
min 1.000000 20.0
20% 1.400000 24.0
50% 2.000000 30.0
80% 2.600000 36.0
max 3.000000 40.0
答案 1 :(得分:0)
第一个问题的解决方案(不确定我这里没有发明自行车):
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})
print(df)
df_t = df.describe()
print(type(df_t))
print(df_t)
print(df_t.columns)
print(df_t.index)
col_names = []
for stat_name in df_t.index:
for col_name in df_t.columns:
col_names.append(str(col_name)+'_'+str(stat_name))
print('col_names',col_names)
N = len(col_names)
print('len(col_names)', N)
row = df_t.values.reshape(1,N)
print('row.shape',row.shape)
df_stat = pd.DataFrame(data=row, columns=col_names)
print(df_stat)
输出:
A B
0 1.0 20
1 NaN 30
2 3.0 40
<class 'pandas.core.frame.DataFrame'>
A B
count 2.000000 3.0
mean 2.000000 30.0
std 1.414214 10.0
min 1.000000 20.0
25% 1.500000 25.0
50% 2.000000 30.0
75% 2.500000 35.0
max 3.000000 40.0
Index(['A', 'B'], dtype='object')
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
col_names ['A_count', 'B_count', 'A_mean', 'B_mean', 'A_std', 'B_std', 'A_min', 'B_min', 'A_25%', 'B_25%', 'A_50%', 'B_50%', 'A_75%', 'B_75%', 'A_max', 'B_max']
len(col_names) 16
row.shape (1, 16)
A_count B_count A_mean B_mean A_std B_std A_min B_min A_25% \
0 2.0 3.0 2.0 30.0 1.414214 10.0 1.0 20.0 1.5
B_25% A_50% B_50% A_75% B_75% A_max B_max
0 25.0 2.0 30.0 2.5 35.0 3.0 40.0
基于 Andy Hayden的第一个问题的另一个解决方案回答:
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})
print(df)
df_t = df.describe()
print(type(df_t))
print(df_t)
print(df_t.columns)
print(df_t.index)
df_s = df_t.stack()
print(type(df_s))
print(df_s)
print(df_s.shape)
df_s.index = df_s.index.map(lambda x : '_'.join(x[::-1]))
print(type(df_s))
print(df_s)
df_s = df_s.to_frame().T
print(type(df_s))
print(df_s)
输出:
A B
0 1.0 20
1 NaN 30
2 3.0 40
<class 'pandas.core.frame.DataFrame'>
A B
count 2.000000 3.0
mean 2.000000 30.0
std 1.414214 10.0
min 1.000000 20.0
25% 1.500000 25.0
50% 2.000000 30.0
75% 2.500000 35.0
max 3.000000 40.0
Index(['A', 'B'], dtype='object')
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
<class 'pandas.core.series.Series'>
count A 2.000000
B 3.000000
mean A 2.000000
B 30.000000
std A 1.414214
B 10.000000
min A 1.000000
B 20.000000
25% A 1.500000
B 25.000000
50% A 2.000000
B 30.000000
75% A 2.500000
B 35.000000
max A 3.000000
B 40.000000
dtype: float64
(16,)
<class 'pandas.core.series.Series'>
A_count 2.000000
B_count 3.000000
A_mean 2.000000
B_mean 30.000000
A_std 1.414214
B_std 10.000000
A_min 1.000000
B_min 20.000000
A_25% 1.500000
B_25% 25.000000
A_50% 2.000000
B_50% 30.000000
A_75% 2.500000
B_75% 35.000000
A_max 3.000000
B_max 40.000000
dtype: float64
<class 'pandas.core.frame.DataFrame'>
A_count B_count A_mean B_mean A_std B_std A_min B_min A_25% \
0 2.0 3.0 2.0 30.0 1.414214 10.0 1.0 20.0 1.5
B_25% A_50% B_50% A_75% B_75% A_max B_max
0 25.0 2.0 30.0 2.5 35.0 3.0 40.0
关于第二个问题,我设法这样做(但是代码不是很漂亮),注意'min','max','sum'
仅用于函数,最初的想法是扩展describe
功能:
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})
print(df)
def func(df, func_name):
if func_name == 'max':
df_t = df.max(axis=0)
elif func_name == 'min':
df_t = df.min(axis=0)
elif func_name == 'sum':
df_t = df.sum(axis=0)
else:
raise NotImplementedError
df_t = df_t.to_frame().T
print(type(df_t))
print(df_t)
df_t.rename(columns=lambda x: x+'_'+func_name,inplace=True)
print(type(df_t))
print(df_t)
return df_t
func_names = ['min','max','sum']
df_list = []
for func_name in func_names:
df_t = func(df, func_name)
df_list.append(df_t)
df_stat = pd.concat(df_list, axis=1)
print(df_stat)
输出:
A B
0 1.0 20
1 NaN 30
2 3.0 40
<class 'pandas.core.frame.DataFrame'>
A B
0 1.0 20.0
<class 'pandas.core.frame.DataFrame'>
A_min B_min
0 1.0 20.0
<class 'pandas.core.frame.DataFrame'>
A B
0 3.0 40.0
<class 'pandas.core.frame.DataFrame'>
A_max B_max
0 3.0 40.0
<class 'pandas.core.frame.DataFrame'>
A B
0 4.0 90.0
<class 'pandas.core.frame.DataFrame'>
A_sum B_sum
0 4.0 90.0
A_min B_min A_max B_max A_sum B_sum
0 1.0 20.0 3.0 40.0 4.0 90.0