填写A列中的缺失值,以B列中的值为条件

时间:2019-07-18 07:36:48

标签: python pandas data-manipulation

我想根据运动员参加的“运动”的平均身高在“身高”列中填写缺失的值。但是我想要在不使用for循环的情况下完成此操作。

我正在使用的数据来自this kaggle data set

import pandas as pd
import numpy as np

df = pd.read_csv("athlete_events.csv")

ID                      Name Sex   Age  Height  Weight            Team  \
0   1                 A Dijiang   M  24.0   180.0    80.0           China   
1   2                  A Lamusi   M  23.0   170.0    60.0           China   
2   3       Gunnar Nielsen Aaby   M  24.0     NaN     NaN         Denmark   
3   4      Edgar Lindenau Aabye   M  34.0     NaN     NaN  Denmark/Sweden   
4   5  Christine Jacoba Aaftink   F  21.0   185.0    82.0     Netherlands   

   NOC        Games  Year  Season       City          Sport  \
0  CHN  1992 Summer  1992  Summer  Barcelona     Basketball   
1  CHN  2012 Summer  2012  Summer     London           Judo   
2  DEN  1920 Summer  1920  Summer  Antwerpen       Football   
3  DEN  1900 Summer  1900  Summer      Paris     Tug-Of-War   
4  NED  1988 Winter  1988  Winter    Calgary  Speed Skating   

                              Event Medal  
0       Basketball Men's Basketball   NaN  
1      Judo Men's Extra-Lightweight   NaN  
2           Football Men's Football   NaN  
3       Tug-Of-War Men's Tug-Of-War  Gold  
4  Speed Skating Women's 500 metres   NaN  
...

我创建了一个数据框,为每个运动提供平均身高(和体重)-该数据框的缺失值已由每列的平均值代替

y = df.groupby("Sport").mean()[["Height","Weight"]].fillna(y.mean())

                      Height     Weight
Sport                                  
Aeronautics       175.610332  71.873264
Alpine Skiing     173.489052  72.068110
Alpinism          175.610332  71.873264
Archery           173.203085  70.011135
Art Competitions  174.644068  75.290909
...

我尝试使用fillna函数

df.Height.fillna()

但是我不知道如何做到这一点,因此代码将查看特定运动员参加的运动,然后从y数据框中查询身高。

我也尝试过使用apply函数,但是我也遇到相同的问题,我不知道如何做到这一点,以便根据运动员的运动水平来查看身高。

0 个答案:

没有答案