将NaN值替换为同一列中的其他值

时间:2020-06-10 19:17:44

标签: python pandas dataframe replace nan

我的DF看起来像这样

id    zip     location
X2    65123   Houston
T5    65123   Houston
A1    nan     Houston
M8    89517   Berkley
X3    89518   Berkley
N2    nan     Berkley
M9    nan     nan

对于“ zip”中的某些值,我没有邮政编码,但在“ location”中有一个条目。
我想用来自同一位置的邮政编码之一来填充“ zip”中的nan值。有时有不止一种选择,例如对于N2,有两种选择89517和89518,选择哪一种并不重要。但是,我不想更改在邮编和位置中有nan的位置。我该怎么办?

2 个答案:

答案 0 :(得分:1)

由于您不在乎要使用哪个值,因此我们可以使用max值:

>>> df['zip'] = df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max())).astype(int)
>>> df

   id    zip location
0  X2  65123  Houston
1  T5  65123  Houston
2  A1  65123  Houston
3  M8  89517  Berkley
4  X3  89518  Berkley
5  N2  89518  Berkley

如果您需要处理ziplocation均为NaN的情况,请首先过滤掉子组:

>>> sub_df = df.loc[df[['zip', 'location']].notna().any(1)]
>>> df
   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1      NaN  Houston
3  M7      NaN      NaN    # <-- added a line in between to show index is maintained
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2      NaN  Berkley
7  M9      NaN      NaN

>>> sub_df
   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1      NaN  Houston    # <-- No index 3
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2      NaN  Berkley

然后执行相同的操作(只是这一次您不必强制转换为int,因为您的框架中仍然会有NaN个字):

df['zip'] = sub_df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max()))

结果:

   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1  65123.0  Houston
3  M7      NaN      NaN
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2  89518.0  Berkley
7  M9      NaN      NaN

答案 1 :(得分:0)

如果您不关心要填写哪个值,一种简单的方法是按位置和邮政编码对表格进行排序,然后将fillna与method ='ffill'配合使用

 >>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2      NaN  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5      NaN  Berkley

>>> df.sort_values(by=['location','zip']).fillna(method='ffill')
       zip location
3  89517.0  Berkley
4  89518.0  Berkley
5  89518.0  Berkley
0  65123.0  Houston
1  65123.0  Houston
2  65123.0  Houston

更新:下面的解决方案也在本地处理nan。首先使用groupby函数,然后在组内通过max填充。

>>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2      NaN  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5      NaN  Berkley
6      NaN      NaN

>>> df['zip'] = df.groupby('location')['zip'].apply(lambda x:x.fillna(x.max()))
>>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2  65123.0  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5  89518.0  Berkley
6      NaN      NaN