是否有一个简单的DataFrame方法根据另一行中另一列中的值按行逐列复制值?

时间:2017-07-26 00:06:44

标签: python pandas dataframe

我有一个DataFrame,其列的数据取决于另一列中的值。不幸的是,我收集数据的来源只提供了第一列('job_id')第一次给出时第二列('host_id')的值。结果是我的'job_id'有很多NaN值。

In [1]: import pandas as pd, numpy as np

In [2]: df = pd.DataFrame({'run_id' : range(10),
   ...:                    'host_id': ['a', 'b', 'c', 'd', 'e', 'a', 'd', 'c', 'a', 'e'],
   ...:                    'job_id': [100253, 100254, 100255, 100256, 100257, np.nan, np.nan, np.nan, np.nan, np.nan]})

In [3]: df
Out[3]: 
  host_id    job_id  run_id
0       a  100253.0       0
1       b  100254.0       1
2       c  100255.0       2
3       d  100256.0       3
4       e  100257.0       4
5       a       NaN       5
6       d       NaN       6
7       c       NaN       7
8       a       NaN       8
9       e       NaN       9

所需的输出方式是'job_id'重复方式与'host_id'相同:

  host_id    job_id  run_id
0       a  100253.0       0
1       b  100254.0       1
2       c  100255.0       2
3       d  100256.0       3
4       e  100257.0       4
5       a  100253.0       5
6       d  100256.0       6
7       c  100255.0       7
8       a  100253.0       8
9       e  100257.0       9

我提出的解决方案是仅提取'host_id''job_id'列,使用NaN删除行,在原始DataFrame上使用左合并,然后重命名/重新排序结果列。

In [3]: host_job_mapping = df[['host_id', 'job_id']].dropna(subset=['job_id'])

In [4]: host_job_mapping
Out[4]: 
  host_id    job_id
0       a  100253.0
1       b  100254.0
2       c  100255.0
3       d  100256.0
4       e  100257.0

In [5]: df = pd.merge(df, host_job_mapping, how='left', on='host_id')

In [6]: df
Out[6]: 
  host_id  job_id_x  run_id  job_id_y
0       a  100253.0       0  100253.0
1       b  100254.0       1  100254.0
2       c  100255.0       2  100255.0
3       d  100256.0       3  100256.0
4       e  100257.0       4  100257.0
5       a       NaN       5  100253.0
6       d       NaN       6  100256.0
7       c       NaN       7  100255.0
8       a       NaN       8  100253.0
9       e       NaN       9  100257.0

In [7]: df = df.rename(columns={'job_id_y': 'job_id'})[['host_id', 'job_id', 'run_id']]

In [8]: df
Out[8]: 
  host_id    job_id  run_id
0       a  100253.0       0
1       b  100254.0       1
2       c  100255.0       2
3       d  100256.0       3
4       e  100257.0       4
5       a  100253.0       5
6       d  100256.0       6
7       c  100255.0       7
8       a  100253.0       8
9       e  100257.0       9

虽然这有效,但它看起来并不特别优雅。是否有更简单或更直接的方法来实现这一目标(不使用apply)?

1 个答案:

答案 0 :(得分:1)

您可以按host_id进行分组,然后执行forward fill

df.groupby('host_id', as_index=False).ffill()

#  host_id    job_id    run_id
#0       a  100253.0    0
#1       b  100254.0    1
#2       c  100255.0    2
#3       d  100256.0    3
#4       e  100257.0    4
#5       a  100253.0    5
#6       d  100256.0    6
#7       c  100255.0    7
#8       a  100253.0    8
#9       e  100257.0    9

如果其他列中可能缺少值:

df['job_id'] = df.job_id.groupby(df.host_id).ffill()

或者按照原始方法,首先获取 host_id job_id 之间的关系,然后使用map从{{job_id获取host_id 1}}:

df.job_id = df.host_id.map(df.set_index('host_id').job_id.dropna())