如果左数据框中的一列与右数据框的任何列匹配,是否有办法合并两个数据框:
SELECT
t1.*, t2.*
FROM
t1
JOIN
t2 ON t1.c1 = t2.c1 OR
t1.c1 = t2.c2 OR
t1.c1 = t2.c3 OR
t1.c1 = t2.c4
Python (类似):
import pandas as pd
dataA = [(1), (2)]
pdA = pd.DataFrame(dataA)
pdA.columns = ['col']
dataB = [(1, None), (None, 2), (1, 2)]
pdB = pd.DataFrame(dataB)
pdB.columns = ['col1', 'col2']
pdA.merge(pdB, left_on='col', right_on='col1') \
.append(pdA.merge(pdB, left_on='col', right_on='col2'))
答案 0 :(得分:0)
看起来我们正在逐行isin
检查。我喜欢使用设置逻辑并使用numpy广播来帮忙。
f = lambda x: set(x.dropna())
npB = pdB.apply(f, 1).values
npA = pdA.apply(f, 1).values
a = npA <= npB[:, None]
m, n = a.shape
rA = np.tile(np.arange(n), m)
rB = np.repeat(np.arange(m), n)
a_ = a.ravel()
pd.DataFrame(
np.hstack([pdA.values[rA[a_]], pdB.values[rB[a_]]]),
columns=pdA.columns.tolist() + pdB.columns.tolist()
)
col col1 col2
0 1.0 1.0 NaN
1 2.0 NaN 2.0
2 1.0 1.0 2.0
3 2.0 1.0 2.0
答案 1 :(得分:0)
不幸的是,我认为没有内置方法可以做到这一点。 pandas
连接非常受限制,因为基本上只能测试左列与右列的相等性,这与SQL更为通用。
虽然可以通过形成交叉产品然后检查所有相关条件来实现。结果它占用了一些内存,但它不应该效率太低。
注意我稍微改变了你的测试用例,使它们更加通用,并将变量重命名为更直观的东西。
import pandas as pd
from functools import reduce
dataA = [1, 2]
dfA = pd.DataFrame(dataA)
dfA.columns = ['col']
dataB = [(1, None, 1), (None, 2, None), (1, 2, None)]
dfB = pd.DataFrame(dataB)
dfB.columns = ['col1', 'col2', 'col3']
print(dfA)
print(dfB)
def cross(left, right):
"""Returns the cross product of the two dataframes, keeping the index of the left"""
# create dummy columns on the dataframes that will always match in the merge
left["_"] = 0
right["_"] = 0
# merge, keeping the left index, and dropping the dummy column
result = left.reset_index().merge(right, on="_").set_index("index").drop("_", axis=1)
# drop the dummy columns from the mutated dataframes
left.drop("_", axis=1, inplace=True)
right.drop("_", axis=1, inplace=True)
return result
def merge_left_in_right(left_df, right_df):
"""Return the join of the two dataframes where the element of the left dataframe's column
is in one of the right dataframe's columns"""
left_col, right_cols = left_df.columns[0], right_df.columns
result = cross(left_df, right_df) # form the cross product with a view to filtering it
# a row must satisfy one of the following conditions:
tests = (result[left_col] == result[right_col] for right_col in right_cols)
# form the disjunction of the conditions
left_in_right = reduce(lambda left_bools, right_bools: left_bools | right_bools, tests)
# return the appropriate rows
return result[left_in_right]
print(merge_left_in_right(dfA, dfB))