Question

如果左数据框中的一列与右数据框的任何列匹配，是否有办法合并两个数据框：

SELECT
  t1.*, t2.*
FROM
  t1
JOIN
  t2 ON t1.c1 = t2.c1 OR 
        t1.c1 = t2.c2 OR 
        t1.c1 = t2.c3 OR 
        t1.c1 = t2.c4

Python （类似）：

import pandas as pd

dataA = [(1), (2)]

pdA = pd.DataFrame(dataA)
pdA.columns = ['col']

dataB = [(1, None), (None, 2), (1, 2)]

pdB = pd.DataFrame(dataB)
pdB.columns = ['col1', 'col2']

pdA.merge(pdB, left_on='col', right_on='col1') \
    .append(pdA.merge(pdB, left_on='col', right_on='col2'))

Answer 1

看起来我们正在逐行isin检查。我喜欢使用设置逻辑并使用numpy广播来帮忙。

f = lambda x: set(x.dropna())
npB = pdB.apply(f, 1).values
npA = pdA.apply(f, 1).values

a = npA <= npB[:, None]
m, n = a.shape

rA = np.tile(np.arange(n), m)
rB = np.repeat(np.arange(m), n)

a_ = a.ravel()

pd.DataFrame(
    np.hstack([pdA.values[rA[a_]], pdB.values[rB[a_]]]),
    columns=pdA.columns.tolist() + pdB.columns.tolist()
)

   col  col1  col2
0  1.0   1.0   NaN
1  2.0   NaN   2.0
2  1.0   1.0   2.0
3  2.0   1.0   2.0

Answer 2

不幸的是，我认为没有内置方法可以做到这一点。 pandas连接非常受限制，因为基本上只能测试左列与右列的相等性，这与SQL更为通用。

虽然可以通过形成交叉产品然后检查所有相关条件来实现。结果它占用了一些内存，但它不应该效率太低。

注意我稍微改变了你的测试用例，使它们更加通用，并将变量重命名为更直观的东西。

import pandas as pd
from functools import reduce

dataA = [1, 2]

dfA = pd.DataFrame(dataA)
dfA.columns = ['col']

dataB = [(1, None, 1), (None, 2, None), (1, 2, None)]

dfB = pd.DataFrame(dataB)
dfB.columns = ['col1', 'col2', 'col3']

print(dfA)
print(dfB)


def cross(left, right):
    """Returns the cross product of the two dataframes, keeping the index of the left"""

    # create dummy columns on the dataframes that will always match in the merge
    left["_"] = 0
    right["_"] = 0

    # merge, keeping the left index, and dropping the dummy column
    result = left.reset_index().merge(right, on="_").set_index("index").drop("_", axis=1)

    # drop the dummy columns from the mutated dataframes
    left.drop("_", axis=1, inplace=True)
    right.drop("_", axis=1, inplace=True)
    return result


def merge_left_in_right(left_df, right_df):
    """Return the join of the two dataframes where the element of the left dataframe's column
    is in one of the right dataframe's columns"""

    left_col, right_cols = left_df.columns[0], right_df.columns

    result = cross(left_df, right_df)    # form the cross product with a view to filtering it

    # a row must satisfy one of the following conditions:
    tests = (result[left_col] == result[right_col] for right_col in right_cols)

    # form the disjunction of the conditions
    left_in_right = reduce(lambda left_bools, right_bools: left_bools | right_bools, tests)

    # return the appropriate rows
    return result[left_in_right]


print(merge_left_in_right(dfA, dfB))

Pandas：如果左列匹配任何右列，则合并

2 个答案: