我有两个数据框,例如:
>>> df1
query target
A:1 AZ
B:4 AZ
C:5 AZ
D:1 AZ
>>> df2
query target
B:6 AZ
C:5 AZ
D:1 AZ
A:1 AZ
想法只是检查df1['query']
中是否存在df2['query']
中的值,无论行的顺序如何,并在df1中添加新列并得到: / p>
>>> df1
query target new_col
A:1 AZ present
B:4 AZ Not_present
C:5 AZ present
D:1 AZ present
我尝试过:df1["new_col"] = df2.apply(lambda row: "present" if row[0] == df1["query"][row.name] else "Not_present", axis = 1)
,但仅检查按行匹配。
感谢您的帮助。
编辑
如果知道我必须将3个数据帧与df1进行比较
这是新的例子:
df1
query
A1
A2
B3
B5
B6
B7
C8
C9
df2
query target
C9 type2
Z6 type2
df3
query target
C10 type3
B6 type3
df4
query target
A1 type4
K9 type1
,我将进行如下循环:
for df in dataframes:
df1['new_col'] = np.where(blast['query'].isin(df['query']), 'Present', 'Not_absent')
问题在于,每次列df1 ['New_col']
时,它将覆盖最后我应该得到:
df1
query new_col
A1 present_type4
A2. not_present
B3. not_present
B5. not_present
B6. present_type3
B7. not_present
C8. not_present
C9. present_type2
为jezrael
编辑:
为了打开数据框,我有一个file.txt
文件,例如:
Species1
Species2
Species3
在数据帧为例的地方调用wright path链接会有所帮助:
/admin/user/project/Species1/dataframe.txt etc
所以我叫他们创建df,例如:
for i in file.txt:
df = open("/admin/user/project/"+i+"/dataframe.txt","r")
然后,如上所述,我必须找到所有这些数据帧和一个大数据帧(df1)
之间的匹配项。
这样做:
values=[]
for names in file.txt:
values.append("/admin/user/project/"+i+"/dataframe.txt")
for names file.txt:
keys.append(names)
dicts = {}
for i in keys:
dicts[i] = values[i]
d = {}
for i in range(len(keys)):
d[i]=None
for i in range(len(keys)):
d[keys[i]] = d.pop(i)
for (k,v), i in zip( d.items(),values):
d[k] = i
我成功地得到了你展示给我的东西:
但是值是打开数据框的路径:
>>> d
{'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}
答案 0 :(得分:2)
将numpy.where
与Series.isin
一起使用:
df1['new_col'] = np.where(df1['query'].isin(df2['query']), 'present', 'Not_present')
print (df1)
query target new_col
0 A:1 AZ present
1 B:4 AZ Not_present
2 C:5 AZ present
3 D:1 AZ present
编辑:
d = {'type2':df2, 'type3':df3, 'type4':df4}
df1['new_col'] = 'not_present'
for k, v in d.items():
df1.loc[df1['query'].isin(v['query']), 'new_col'] = 'Present_{}'.format(k)
print (df1)
query new_col
0 A1 Present_type4
1 A2 not_present
2 B3 not_present
3 B5 not_present
4 B6 Present_type3
5 B7 not_present
6 C8 not_present
7 C9 Present_type2
编辑:您可以循环创建DataFrame并传递给isin
:
d = {'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}
df1['new_col'] = 'not_present'
for k, v in d.items():
df = pd.read_csv(v)
df1.loc[df1['query'].isin(df['query']), 'new_col'] = 'Present_{}'.format(k)
答案 1 :(得分:1)
使用df.loc[]
的解决方案:
df1.loc[df1['query'].isin(df2['query']),'new_col']='present'
df1.new_col=df1.new_col.fillna('Not_present')
print(df1)
query target new_col
0 A:1 AZ present
1 B:4 AZ Not_present
2 C:5 AZ present
3 D:1 AZ present
答案 2 :(得分:0)
使用pd.merge
df_temp = df_2.copy()
df_temp['new_col'] = 'present'
df_temp = df_temp['query', new_col]
df1 = df1.merge(df_temp, how='left', on='query').fillna('Not_present')