我目前正在为Lyda的Udacity项目整理来自Lyft的2个mio行的大数据集。 DataFrame看起来像这样:
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 NaN NaN 37.775232 -122.224498
当我尝试在最后一行中表示时,我有很多id
和name
的NaN值,但是latitude
和longitude
绝不会为空。我的假设是,在给定name
和latitude
一定组合的情况下,我实际上可以从其他行中提取longitude
。
一旦有了名字,我就会尝试使用id
来填充name
的NaN值
dict_id = dict(zip(df['name'], df['id']))
df['id'] = df['id'].fillna(df['name'].map(dict_id))
但是,我很努力,因为使用latitude
和longitude
我有两个值要与名称匹配。
答案 0 :(得分:4)
您可以在dropna之后将数据框与其副本合并,然后重命名列:
m = df.merge(df.dropna(subset=['name']),on=['latitude','longitude'],
how='left',suffixes=('','_y'))
out = (m.drop(['id','name'],1).rename(columns={'id_y':'id','name_y':'name'})
.reindex(df.columns,axis=1))
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 237.0 Fruitvale BART Station 37.775232 -122.224498