Question

我有两个多索引数据帧，一个有两个级别，一个有三个。前两个级别在两个数据帧中都匹配。我想找到第一个数据帧中的所有值，其中前两个索引级别在第二个数据帧中匹配。第二个数据框没有第三级。

我找到的最接近的答案是： How to slice one MultiIndex DataFrame with the MultiIndex of another - 但设置略有不同，似乎并未转化为此案例。

考虑下面的设置

array_1 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']),
np.array(['a', 'a','a', 'a','b','b','b','b' ])]

array_2 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
      np.array(['one', 'two', 'three', 'one', 'two', 'two', 'one', 'two'])]

df_1 = pd.DataFrame(np.random.randn(8,4), index=array_1).sort_index()

print df_1
                  0         1         2         3
bar one a  1.092651 -0.325324  1.200960 -0.790002
    two a -0.415263  1.006325 -0.077898  0.642134
baz one a -0.343707  0.474817  0.396702 -0.379066
    two a  0.315192 -1.548431 -0.214253 -1.790330
foo one b  1.022050 -2.791862  0.172165  0.924701
    two b  0.622062 -0.193056 -0.145019  0.763185
qux one b -1.241954 -1.270390  0.147623 -0.301092
    two b  0.778022  1.450522  0.683487 -0.950528

df_2 = pd.DataFrame(np.random.randn(8,4), index=array_2).sort_index()

print df_2

                  0         1         2         3
bar one   -0.354889 -1.283470 -0.977933 -0.601868
    two   -0.849186 -2.455453  0.790439  1.134282
baz one   -0.143299  2.372440 -0.161744  0.919658
    three -1.008426 -0.116167 -0.268608  0.840669
foo two   -0.644028  0.447836 -0.576127 -0.891606
    two   -0.163497 -1.255801 -1.066442  0.624713
qux one   -1.545989 -0.422028 -0.489222 -0.357954
    two   -1.202655  0.736047 -1.084002  0.732150

现在我查询第二个数据帧，返回原始索引的子集

df_2_selection = df_2[(df_2 > 1).any(axis=1)]
print df_2_selection

                0         1         2         3
bar two -0.849186 -2.455453  0.790439  1.134282
baz one -0.143299  2.372440 -0.161744  0.919658

我想找到df_1中与df_2中找到的索引匹配的所有值。前两个级别排队，但第三个级别没有排队。

当索引排成一行时，这个问题很容易解决，并且可以通过类似df_1.loc[df_2_selection.index] #this works if indexes are the same

的方法解决

此外，我可以找到与其中一个级别相匹配的值 df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)]但这并不能解决问题。

将这些语句链接在一起并不能提供所需的功能

df_1[(df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)) & (df_1.index.isin(df_2_selection.index.get_level_values(1),level = 1))]

我想象的是：

df_1_select = df_1[(df_1.index.isin(
    df_2_selection.index.get_level_values([0,1]),level = [0,1])) #Doesnt Work

print df_1_select

                  0         1         2         3
bar two a -0.415263  1.006325 -0.077898  0.642134
baz one a -0.343707  0.474817  0.396702 -0.379066

我尝试了很多其他方法，所有这些方法都没有完全符合我的要求。谢谢您的考虑。

编辑：

此 df_1.loc[pd_idx[df_2_selection.index.get_level_values(0),df_2_selection.index.get_level_values(1),:],:]也无效

我只想要两个级别匹配的行。不是任何级别匹配的地方。

编辑2：此解决方案由已删除它的人发布

id=[x+([x for x in df_1.index.levels[-1]]) for x in df_2_selection.index.values]

pd.concat([df_1.loc[x] for x in id])

确实有效！然而，在大型数据帧上，它非常慢。任何有关新方法/加速的帮助都非常感谢。

Answer 1

您可以使用reset_index()和merge()。

df_2_selection为：

                0         1         2         3
foo two -0.530151  0.932007 -1.255259  2.441294
qux one  2.006270  1.087412 -0.840916 -1.225508

合并：

lvls = ["level_0","level_1"]

(df_1.reset_index()
 .merge(df_2_selection.reset_index()[lvls], on=lvls)
 .set_index(["level_0","level_1","level_2"])
 .rename_axis([None]*3)
)

输出：

                  0         1         2         3
foo two b -0.112696  0.287421 -0.380692 -0.035471
qux one b  0.658227  0.632667 -0.193224  1.073132

注意：rename_axis()部分只删除级别名称，例如level_0。它纯粹是装饰性的，不需要执行实际的匹配程序。

Answer 2

试试这个：

pd.concat([
    df_1.xs(key, drop_level=False)
    for key in df_2_selection.index.values])

熊猫：当某些级别不匹配时，将一个多索引数据帧切成另一个多索引数据帧

2 个答案: