检查两个数据框之间的匹配,无论是否有熊猫行

时间:2019-03-04 10:02:09

标签: python pandas

我有两个数据框,例如:

>>> df1

query   target     
A:1     AZ     
B:4     AZ  
C:5     AZ    
D:1     AZ  

>>> df2

query   target
B:6     AZ
C:5     AZ
D:1     AZ
A:1     AZ

想法只是检查df1['query']中是否存在df2['query']中的值,无论行的顺序如何,并在df1中添加新列并得到: / p>

>>> df1

query   target    new_col 
A:1     AZ        present
B:4     AZ        Not_present
C:5     AZ        present
D:1     AZ        present

我尝试过:df1["new_col"] = df2.apply(lambda row: "present" if row[0] == df1["query"][row.name] else "Not_present", axis = 1)

,但仅检查按行匹配。

感谢您的帮助。

编辑

如果知道我必须将3个数据帧与df1进行比较

这是新的例子:

df1 

query
A1
A2
B3
B5
B6
B7
C8
C9

df2

query target
C9    type2
Z6    type2

df3
query target
C10   type3
B6    type3

df4
query target
A1    type4
K9    type1

,我将进行如下循环:

for df in dataframes: 
   df1['new_col'] = np.where(blast['query'].isin(df['query']), 'Present', 'Not_absent')

问题在于,每次列df1 ['New_col']

时,它将覆盖

最后我应该得到:

df1 

    query   new_col
    A1      present_type4
    A2.     not_present
    B3.     not_present
    B5.     not_present
    B6.     present_type3
    B7.     not_present
    C8.     not_present
    C9.     present_type2

jezrael编辑:

为了打开数据框,我有一个file.txt文件,例如:

Species1
Species2
Species3

在数据帧为例的地方调用wright path链接会有所帮助:

/admin/user/project/Species1/dataframe.txt etc

所以我叫他们创建df,例如:

for i in file.txt:
 df = open("/admin/user/project/"+i+"/dataframe.txt","r")

然后,如上所述,我必须找到所有这些数据帧和一个大数据帧(df1)之间的匹配项。

这样做:

values=[]
for names in file.txt:
    values.append("/admin/user/project/"+i+"/dataframe.txt") 

for names file.txt:
    keys.append(names)

dicts = {}
for i in keys:
        dicts[i] = values[i]
d = {}
for i in range(len(keys)):
    d[i]=None

for i in range(len(keys)):
    d[keys[i]] = d.pop(i)

for (k,v), i in zip( d.items(),values):
    d[k] = i

我成功地得到了你展示给我的东西:

但是值是打开数据框的路径:

>>> d
{'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}

3 个答案:

答案 0 :(得分:2)

numpy.whereSeries.isin一起使用:

df1['new_col'] = np.where(df1['query'].isin(df2['query']), 'present', 'Not_present')
print (df1)
  query target      new_col
0   A:1     AZ      present
1   B:4     AZ  Not_present
2   C:5     AZ      present
3   D:1     AZ      present

编辑:

d = {'type2':df2, 'type3':df3, 'type4':df4}
df1['new_col'] = 'not_present'
for k, v in d.items(): 
   df1.loc[df1['query'].isin(v['query']), 'new_col'] = 'Present_{}'.format(k)

print (df1)
  query        new_col
0    A1  Present_type4
1    A2    not_present
2    B3    not_present
3    B5    not_present
4    B6  Present_type3
5    B7    not_present
6    C8    not_present
7    C9  Present_type2

编辑:您可以循环创建DataFrame并传递给isin

d = {'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}

df1['new_col'] = 'not_present'
for k, v in d.items(): 
    df = pd.read_csv(v)
    df1.loc[df1['query'].isin(df['query']), 'new_col'] = 'Present_{}'.format(k)

答案 1 :(得分:1)

使用df.loc[]的解决方案:

df1.loc[df1['query'].isin(df2['query']),'new_col']='present'
df1.new_col=df1.new_col.fillna('Not_present')
print(df1)

  query target      new_col
0   A:1     AZ      present
1   B:4     AZ  Not_present
2   C:5     AZ      present
3   D:1     AZ      present

答案 2 :(得分:0)

使用pd.merge

的另一种解决方案
df_temp = df_2.copy()
df_temp['new_col'] = 'present'
df_temp = df_temp['query', new_col]
df1 = df1.merge(df_temp, how='left', on='query').fillna('Not_present')