我想根据运动员参加的“运动”的平均身高在“身高”列中填写缺失的值。但是我想要在不使用for循环的情况下完成此操作。
我正在使用的数据来自this kaggle data set。
import pandas as pd
import numpy as np
df = pd.read_csv("athlete_events.csv")
ID Name Sex Age Height Weight Team \
0 1 A Dijiang M 24.0 180.0 80.0 China
1 2 A Lamusi M 23.0 170.0 60.0 China
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands
NOC Games Year Season City Sport \
0 CHN 1992 Summer 1992 Summer Barcelona Basketball
1 CHN 2012 Summer 2012 Summer London Judo
2 DEN 1920 Summer 1920 Summer Antwerpen Football
3 DEN 1900 Summer 1900 Summer Paris Tug-Of-War
4 NED 1988 Winter 1988 Winter Calgary Speed Skating
Event Medal
0 Basketball Men's Basketball NaN
1 Judo Men's Extra-Lightweight NaN
2 Football Men's Football NaN
3 Tug-Of-War Men's Tug-Of-War Gold
4 Speed Skating Women's 500 metres NaN
...
我创建了一个数据框,为每个运动提供平均身高(和体重)-该数据框的缺失值已由每列的平均值代替
y = df.groupby("Sport").mean()[["Height","Weight"]].fillna(y.mean())
Height Weight
Sport
Aeronautics 175.610332 71.873264
Alpine Skiing 173.489052 72.068110
Alpinism 175.610332 71.873264
Archery 173.203085 70.011135
Art Competitions 174.644068 75.290909
...
我尝试使用fillna
函数
df.Height.fillna()
但是我不知道如何做到这一点,因此代码将查看特定运动员参加的运动,然后从y
数据框中查询身高。
我也尝试过使用apply
函数,但是我也遇到相同的问题,我不知道如何做到这一点,以便根据运动员的运动水平来查看身高。