Python-计算数据框列的标准偏差(行级别)

时间:2018-12-17 05:31:08

标签: python-3.x pandas

我创建了一个熊猫数据框,并能够确定此数据框的一列或多列(列级)的标准偏差。我需要确定特定列的所有行的标准差。以下是我到目前为止尝试过的命令

inp_df.std() ### Will determine standard deviation of all the numerical columns by default

salary         8.194421e-01
num_months     3.690081e+05
no_of_hours    2.518869e+02

inp_df.std(axis = 0) ### Same as above command. Performs standard deviation column level

inp_df[['salary']].std() ### Determines Standard Deviation over only the salary column of the dataframe
salary         8.194421e-01

inp_df.std(axis=1) ### Determines Standard Deviation for every row present in the dataframe. But it does this for the entire row and it will output values in a single column. One std value for each row

0       4.374107e+12
1       4.377543e+12
2       4.374026e+12
3       4.374046e+12
4       4.374112e+12
5       4.373926e+12
.
.
.

当我执行以下命令时,所有记录均显示“ NaN”。有什么办法解决这个问题?

inp_df[['salary']].std(axis = 1) ### Trying to determine standard deviation only for "salary" column at the row level

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
.
.
.
.
.

1 个答案:

答案 0 :(得分:3)

这是预料之中的,因为是否选中DataFrame.std

  

默认情况下被N-1标准化。可以使用ddof参数更改

如果元素为1,则表示要除以0。因此,如果有一个列,并且想要对列进行抽样标准偏差,则会得到所有缺失值。

示例

inp_df = pd.DataFrame({'salary':[10,20,30],
                       'num_months':[1,2,3],
                       'no_of_hours':[2,5,6]})
print (inp_df)
   salary  num_months  no_of_hours
0      10           1            2
1      20           2            5
2      30           3            6

[]的{​​{1}}中选择一列:

Series

获取print (inp_df['salary']) 0 10 1 20 2 30 Name: salary, dtype: int64 中的std-获取标量:

Series

print (inp_df['salary'].std()) 10.0 []的两倍one column DataFrame中选择一列:

print (inp_df[['salary']])
   salary
0      10
1      20
2      30

每个索引获取std中的DataFrame(默认值)-获取一个元素Series

print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary    10.0
dtype: float64

每列(轴= 1)获得std中的DataFrame-获取所有NaN:

print (inp_df[['salary']].std(axis = 1))
0   NaN
1   NaN
2   NaN
dtype: float64

如果将默认的ddof=1更改为ddof=0

print (inp_df[['salary']].std(axis = 1, ddof=0))
0    0.0
1    0.0
2    0.0
dtype: float64

如果要std分成2列或更多列:

#select 2 columns
print (inp_df[['salary', 'num_months']])
   salary  num_months
0      10           1
1      20           2
2      30           3

#std by index
print (inp_df[['salary','num_months']].std())
salary        10.0
num_months     1.0
dtype: float64

#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0     5.656854
1    10.606602
2    16.970563
dtype: float64