如何根据其他几列填充一列?

时间:2020-07-28 12:12:55

标签: python pandas dataframe

我有两个这样的数据框:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        'A': list('aaabdcde'),
        'B': list('smnipiuy'),
        'C': list('zzzqqwll')
    }
)

df2 = pd.DataFrame(
    {
        'mapcol': list('abpppozl')
    }
)

   A  B  C
0  a  s  z
1  a  m  z
2  a  n  z
3  b  i  q
4  d  p  q
5  c  i  w
6  d  u  l
7  e  y  l

  mapcol
0      a
1      b
2      p
3      p
4      p
5      o
6      z
7      l

现在,我想在df1中创建一个附加列,该列应分别填充来自列ABC的值,这取决于它们是否值可以在df2['mapcol']中找到。如果一行中的值可以在多列中找到,则应首先从A开始使用它们,然后从B然后到C使用它们,因此我的预期结果如下所示:< / p>

   A  B  C final
0  a  s  z     a  # <- values can be found in A and C, but A is preferred
1  a  m  z     a  # <- values can be found in A and C, but A is preferred
2  a  n  z     a  # <- values can be found in A and C, but A is preferred
3  b  i  q     b  # <- value can be found in A 
4  d  p  q     p  # <- value can be found in B
5  c  i  w   NaN  # none of the values can be mapped
6  d  u  l     l  # value can be found in C
7  e  y  l     l  # value can be found in C

一个简单的实现可能看起来像这样(以首选顺序使用final反复填充fillna列):

preferred_order = ['A', 'B', 'C']

df1['final'] = np.nan

for col in preferred_order:
    df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])

给出理想的结果。

有人看到避免循环的解决方案吗?

2 个答案:

答案 0 :(得分:5)

使用:

order =  ['A', 'B', 'C'] # order of columns

d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)

详细信息:

使用DataFrame.isin并使用DataFrame.anyaxis=1沿DataFrame.idxmax进行布尔掩码过滤行,然后使用axis=1沿DataFrame.lookup获取与最大值关联的列名沿axis=1

print(d)
0    A
1    A
2    A
3    A
4    B
6    C
7    C
dtype: object

使用https://hub.gke.mybinder.org/user/bastula-dicom-notebooks-ubwtdapd/notebooks/dicompyler-core_usage.ipynbdf1中查找与index的{​​{1}}和columns对应的值,并将该值分配给列d

final

答案 1 :(得分:5)

您可以在完整数据帧isin上使用wheredf1来掩盖df2中不存在的值,然后用preferred_order和{ {3}}沿列,第一列保留iloc

preferred_order = ['A', 'B', 'C']

df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
                   [preferred_order]
                   .bfill(axis=1)
                   .iloc[:, 0]
               )
print (df1)
   A  B  C final
0  a  s  z     a
1  a  m  z     a
2  a  n  z     a
3  b  i  q     b
4  d  p  q     p
5  c  i  w   NaN
6  d  u  l     l
7  e  y  l     l