我创建了一个熊猫数据框,并能够确定此数据框的一列或多列(列级)的标准偏差。我需要确定特定列的所有行的标准差。以下是我到目前为止尝试过的命令
inp_df.std() ### Will determine standard deviation of all the numerical columns by default
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
inp_df.std(axis = 0) ### Same as above command. Performs standard deviation column level
inp_df[['salary']].std() ### Determines Standard Deviation over only the salary column of the dataframe
salary 8.194421e-01
inp_df.std(axis=1) ### Determines Standard Deviation for every row present in the dataframe. But it does this for the entire row and it will output values in a single column. One std value for each row
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
.
.
.
当我执行以下命令时,所有记录均显示“ NaN”。有什么办法解决这个问题?
inp_df[['salary']].std(axis = 1) ### Trying to determine standard deviation only for "salary" column at the row level
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
.
.
.
.
.
答案 0 :(得分:3)
这是预料之中的,因为是否选中DataFrame.std
:
默认情况下被N-1标准化。可以使用ddof参数更改
如果元素为1,则表示要除以0。因此,如果有一个列,并且想要对列进行抽样标准偏差,则会得到所有缺失值。
示例:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
为[]
的{{1}}中选择一列:
Series
获取print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
中的std
-获取标量:
Series
为print (inp_df['salary'].std())
10.0
为[]
的两倍one column DataFrame
中选择一列:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
每个索引获取std
中的DataFrame
(默认值)-获取一个元素Series
:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
每列(轴= 1)获得std
中的DataFrame
-获取所有NaN:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
如果将默认的ddof=1
更改为ddof=0
:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
如果要std
分成2列或更多列:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64