Python / Pandas Dataframe用中值替换0

时间:2016-05-29 05:21:27

标签: python pandas dataframe mean median

我有一个包含多个列的python pandas数据框,一列有0个值。我想将0值替换为此列的medianmean

data是我的数据框
artist_hotness是列

mean_artist_hotness = data['artist_hotness'].dropna().mean()

if len(data.artist_hotness[ data.artist_hotness.isnull() ]) > 0:
data.artist_hotness.loc[ (data.artist_hotness.isnull()), 'artist_hotness'] = mean_artist_hotness

我尝试了这个,但它没有用。

4 个答案:

答案 0 :(得分:9)

使用replace df = pd.DataFrame({'a': [1,2,3,4,0,0,0,0], 'b': [2,3,4,6,0,5,3,8]}) df a b 0 1 2 1 2 3 2 3 4 3 4 6 4 0 0 5 0 5 6 0 3 7 0 8 df['a']=df['a'].replace(0,df['a'].mean()) df a b 0 1 2 1 2 3 2 3 4 3 4 6 4 1 0 5 1 5 6 1 3 7 1 8 方法:

char

答案 1 :(得分:4)

我认为您可以使用mask并将参数skipna=True添加到mean而不是dropna。如果需要替换data.artist_hotness == 0值,还需要将条件更改为0;如果需要替换data.artist_hotness.isnull()值,则需要NaN

import pandas as pd
import numpy as np

data = pd.DataFrame({'artist_hotness': [0,1,5,np.nan]})
print (data)
   artist_hotness
0             0.0
1             1.0
2             5.0
3             NaN

mean_artist_hotness = data['artist_hotness'].mean(skipna=True)
print (mean_artist_hotness)
2.0

data['artist_hotness']=data.artist_hotness.mask(data.artist_hotness == 0,mean_artist_hotness)
print (data)
   artist_hotness
0             2.0
1             1.0
2             5.0
3             NaN

或者使用loc,但省略列名:

data.loc[data.artist_hotness == 0, 'artist_hotness'] = mean_artist_hotness
print (data)
   artist_hotness
0             2.0
1             1.0
2             5.0
3             NaN

data.artist_hotness.loc[data.artist_hotness == 0, 'artist_hotness'] = mean_artist_hotness
print (data)
  

IndexingError:(0 True   1错   2错   3错   姓名:artist_hotness,dtype:bool,'artist_hotness')

另一个解决方案是DataFrame.replace,其中包含指定列:

data=data.replace({'artist_hotness': {0: mean_artist_hotness}}) 
print (data)
    aa  artist_hotness
0  0.0             2.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN 

或者如果需要替换所有列中的所有0值:

import pandas as pd
import numpy as np

data = pd.DataFrame({'artist_hotness': [0,1,5,np.nan], 'aa': [0,1,5,np.nan]})
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN

mean_artist_hotness = data['artist_hotness'].mean(skipna=True)
print (mean_artist_hotness)
2.0

data=data.replace(0,mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  2.0             2.0
1  1.0             1.0
2  5.0             5.0
3  NaN             NaN

如果需要在所有列中替换NaN,请使用DataFrame.fillna

data=data.fillna(mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  2.0             2.0

但如果仅在某些列中使用Series.fillna

data['artist_hotness'] = data.artist_hotness.fillna(mean_artist_hotness) 
print (data)
    aa  artist_hotness
0  0.0             0.0
1  1.0             1.0
2  5.0             5.0
3  NaN             2.0

答案 2 :(得分:1)

data['artist_hotness'] = data['artist_hotness'].map( lambda x : data.artist_hotness.mean() if x == 0 else x)

答案 3 :(得分:0)

发现这些功能非常有用,尽管mask确实很慢(不确定原因)。

我这样做了:

df.loc[ df['artist_hotness'] == 0 | np.isnan(df['artist_hotness']), 'artist_hotness' ] = df['artist_hotness'].median()