如何在其他四个数据框的列中检查一个或哪些数据框列可用?

时间:2017-01-30 06:46:51

标签: python pandas numpy dataframe

我有一个基本数据框如下 -

df1_data = {'id' :{0:'101',1:'102',2:'103',3:'104',4:'105'},
        'sym1' :{0:'abc',1:'pqr',2:'xyz',3:'mno',4:'lmn'}}
df1 = pd.DataFrame(df1_data)
print df1

    id sym1
0  101  abc
1  102  pqr
2  103  xyz
3  104  mno
4  105  lmn

从这个数据框架中,我想在其他四个数据帧列中检查列 sym1 是否可用?

四种不同的数据框架:

df2_data = {'sym2' :{0:'abc',1:'xxx',2:'xyz',3:'mno'},
        'name' :{0:'a',1:'b',2:'c',3:'d'}}
df2 = pd.DataFrame(df2_data)
print df2

df3_data = {'sym2' :{0:'abc',1:'xxx',2:'xyz',3:'mno'},
            'name' :{0:'h',1:'i',2:'k',3:'l'}}
df3 = pd.DataFrame(df2_data)
print df3

df4_data = {'sym2' :{0:'abc',1:'xxx',2:'xyz',3:'mno'},
            'name' :{0:'p',1:'q',2:'r',3:'s'}}
df4 = pd.DataFrame(df4_data)
print df4

df5_data = {'sym2' :{0:'abc',1:'xxx',2:'xyz',3:'mno'},
            'name' :{0:'w',1:'x',2:'y',3:'z'}}
df5 = pd.DataFrame(df5_data)
print df5

在数据帧df2中可用的列sym2中,df3,df4,df5可能包含相同的符号,也可能不包含相同的符号。我的意图是检查df2,df3,df4,df5数据帧sym2列值中是否有sym1列值?

预期输出

    id sym1
0  102  pqr
1  105  lmn

结论 -

符号 pqr lmn 在数据帧df2,df3,df4和df5的sym2列中不可用。

2 个答案:

答案 0 :(得分:5)

  • 使用isin检查df1.sym1的每个元素是否在其他可迭代内
  • 使用pd.concat将所有其他数据框串在一起
df1[~df1.sym1.isin(pd.concat([df2, df3, df4, df5]).sym2)]

    id sym1
1  102  pqr
4  105  lmn

numpy变体,快3倍

df1[~df1.sym1.isin(np.concatenate([d.sym2.values for d in [df2, df3, df4, df5]]))]

答案 1 :(得分:4)

merge和参数indicator进行比较的另一种解决方案:

dfs = [df2,df3,df4,df5]
df = pd.concat(dfs, keys=['df2','df3','df4','df5'])
print (df)
      name sym2
df2 0    a  abc
    1    b  xxx
    2    c  xyz
    3    d  mno
df3 0    a  abc
    1    b  xxx
    2    c  xyz
    3    d  mno
df4 0    p  abc
    1    q  xxx
    2    r  xyz
    3    s  mno
df5 0    w  abc
    1    x  xxx
    2    y  xyz
    3    z  mno
merged = pd.merge(df.rename_axis(['dfs','idx']).reset_index(), 
                  df1, 
                  left_on='sym2', 
                  right_on='sym1', 
                  how='outer', 
                  indicator=True)
print (merged)
    dfs  idx name sym2   id sym1      _merge
0   df2  0.0    a  abc  101  abc        both
1   df3  0.0    a  abc  101  abc        both
2   df4  0.0    p  abc  101  abc        both
3   df5  0.0    w  abc  101  abc        both
4   df2  1.0    b  xxx  NaN  NaN   left_only
5   df3  1.0    b  xxx  NaN  NaN   left_only
6   df4  1.0    q  xxx  NaN  NaN   left_only
7   df5  1.0    x  xxx  NaN  NaN   left_only
8   df2  2.0    c  xyz  103  xyz        both
9   df3  2.0    c  xyz  103  xyz        both
10  df4  2.0    r  xyz  103  xyz        both
11  df5  2.0    y  xyz  103  xyz        both
12  df2  3.0    d  mno  104  mno        both
13  df3  3.0    d  mno  104  mno        both
14  df4  3.0    s  mno  104  mno        both
15  df5  3.0    z  mno  104  mno        both
16  NaN  NaN  NaN  NaN  102  pqr  right_only
17  NaN  NaN  NaN  NaN  105  lmn  right_only
print (merged.loc[merged['_merge']=='right_only', ['id','sym1']])
     id sym1
16  102  pqr
17  105  lmn

print (merged.loc[merged['_merge']=='left_only', ['dfs', 'sym2']])
   dfs sym2
4  df2  xxx
5  df3  xxx
6  df4  xxx
7  df5  xxx