我有一个DataFrame
,其列的数据取决于另一列中的值。不幸的是,我收集数据的来源只提供了第一列('job_id'
)第一次给出时第二列('host_id'
)的值。结果是我的'job_id'
有很多NaN
值。
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame({'run_id' : range(10),
...: 'host_id': ['a', 'b', 'c', 'd', 'e', 'a', 'd', 'c', 'a', 'e'],
...: 'job_id': [100253, 100254, 100255, 100256, 100257, np.nan, np.nan, np.nan, np.nan, np.nan]})
In [3]: df
Out[3]:
host_id job_id run_id
0 a 100253.0 0
1 b 100254.0 1
2 c 100255.0 2
3 d 100256.0 3
4 e 100257.0 4
5 a NaN 5
6 d NaN 6
7 c NaN 7
8 a NaN 8
9 e NaN 9
所需的输出方式是'job_id'
重复方式与'host_id'
相同:
host_id job_id run_id
0 a 100253.0 0
1 b 100254.0 1
2 c 100255.0 2
3 d 100256.0 3
4 e 100257.0 4
5 a 100253.0 5
6 d 100256.0 6
7 c 100255.0 7
8 a 100253.0 8
9 e 100257.0 9
我提出的解决方案是仅提取'host_id'
和'job_id'
列,使用NaN
删除行,在原始DataFrame上使用左合并,然后重命名/重新排序结果列。
In [3]: host_job_mapping = df[['host_id', 'job_id']].dropna(subset=['job_id'])
In [4]: host_job_mapping
Out[4]:
host_id job_id
0 a 100253.0
1 b 100254.0
2 c 100255.0
3 d 100256.0
4 e 100257.0
In [5]: df = pd.merge(df, host_job_mapping, how='left', on='host_id')
In [6]: df
Out[6]:
host_id job_id_x run_id job_id_y
0 a 100253.0 0 100253.0
1 b 100254.0 1 100254.0
2 c 100255.0 2 100255.0
3 d 100256.0 3 100256.0
4 e 100257.0 4 100257.0
5 a NaN 5 100253.0
6 d NaN 6 100256.0
7 c NaN 7 100255.0
8 a NaN 8 100253.0
9 e NaN 9 100257.0
In [7]: df = df.rename(columns={'job_id_y': 'job_id'})[['host_id', 'job_id', 'run_id']]
In [8]: df
Out[8]:
host_id job_id run_id
0 a 100253.0 0
1 b 100254.0 1
2 c 100255.0 2
3 d 100256.0 3
4 e 100257.0 4
5 a 100253.0 5
6 d 100256.0 6
7 c 100255.0 7
8 a 100253.0 8
9 e 100257.0 9
虽然这有效,但它看起来并不特别优雅。是否有更简单或更直接的方法来实现这一目标(不使用apply
)?
答案 0 :(得分:1)
您可以按host_id
进行分组,然后执行forward fill:
df.groupby('host_id', as_index=False).ffill()
# host_id job_id run_id
#0 a 100253.0 0
#1 b 100254.0 1
#2 c 100255.0 2
#3 d 100256.0 3
#4 e 100257.0 4
#5 a 100253.0 5
#6 d 100256.0 6
#7 c 100255.0 7
#8 a 100253.0 8
#9 e 100257.0 9
如果其他列中可能缺少值:
df['job_id'] = df.job_id.groupby(df.host_id).ffill()
或者按照原始方法,首先获取 host_id 和 job_id 之间的关系,然后使用map
从{{job_id
获取host_id
1}}:
df.job_id = df.host_id.map(df.set_index('host_id').job_id.dropna())