我有一个要求,我必须在两个数据帧的两列之间进行精确匹配。
df[res_name] = df[plain_col] == df[b_col]
现在,我希望为其添加一个包含逻辑。
例如,如果
在df [plain_col]中找到df [b_col]值,然后返回True,否则返回False。
用法
具有值为1A的df [b_col]和具有值为1A12的df [Plain_col]。然后输出将为True。
答案 0 :(得分:1)
我认为您需要使用zip
和in
进行列表理解才能进行逐行处理:
df = pd.DataFrame({'plain_col':['1A12','1C12','1B12'],
'b_col':['1A','1B','1C']})
df['res_name'] = [b in a for a, b in zip(df['plain_col'], df['b_col'])]
print (df)
plain_col b_col res_name
0 1A12 1A True
1 1C12 1B False
2 1B12 1C False
性能:
df = pd.DataFrame({'plain_col':['1A12','1C12','1B12'],
'b_col':['1A','1B','1C']})
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [15]: %timeit df['res_name'] = [b in a for a, b in zip(df['plain_col'], df['b_col'])]
605 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %timeit df['res_name'] = df.apply(lambda row:row.b_col in row.plain_col, axis=1)
75.2 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
编辑:
错误argument of type float is not iteratable
显然表示缺少值,可能的解决方法是:
df = pd.DataFrame({'plain_col':['1A12','1C12',np.nan],
'b_col':['1A','1B','1C']})
def func(a, b):
if (a != a) or (b != b):
return False
return b in a
df['res_name'] = list(map(func, df['plain_col'], df['b_col']))
print (df)
plain_col b_col res_name
0 1A12 1A True
1 1C12 1B False
2 NaN 1C False
另一个更通用的解决方案:
df = pd.DataFrame({'plain_col':['1A12',6.7,np.nan],
'b_col':['1A','1B','1C']})
def func(a, b):
try:
return b in a
except Exception:
return False
df['res_name'] = list(map(func, df['plain_col'], df['b_col']))
print (df)
plain_col b_col res_name
0 1A12 1A True
1 6.7 1B False
2 NaN 1C False
答案 1 :(得分:1)
那
df['res_name'] = df.apply(lambda row:row.b_col in row.plain_col, axis=1)