Question

我想基于另一个具有较低维度索引的DataFrame过滤多索引DataFrame，如下例所示：

import io
import pandas as pd

df1 = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
1          1002      1        9
2          1001      1        3
2          1002      2        4
''')

df2 = io.StringIO('''\
ID1        ID2      Value   
1          1001    2
2          1002    3
''')

expected_result = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
2          1002      2        4
''')

df1 = pd.read_table(df1, sep='\s+').set_index(['ID1', 'ID2', 'ID3'])
df2 = pd.read_table(df2, sep='\s+').set_index(['ID1', 'ID2'])
expected_result = (pd.read_table(expected_result, sep='\s+')
                   .set_index(['ID1', 'ID2', 'ID3']))

assert all(df1.loc[d2.index] == expected_result) # won't work

如果两个数据帧具有相同的维度，则可以简单地：

df1.loc[df2.index]

相当于相同维度索引的列表，例如

df1.loc[(1, 1001, 1), (1, 1001, 2)]

也可以选择基于较低维度索引的单个元素，如下所示：

d1.loc[(1, 1001)]

但是如何基于较低维度的列表（或其他索引）进行过滤？

Answer 1

获得理想的结果似乎有点棘手。与pandas 0.19.2一样，当提供精确定义的行的可迭代时，多索引标签定位器# this should give the correct result desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) # messes up with varying levels print(df1.loc[desired_rows, :]) Value ID1 ID2 ID3 1 1001 2 2 # when reducing the index to the first two same levels, it works print(df1.loc[desired_rows[:2], :]) Value ID1 ID2 ID3 1 1001 1 1 2 2似乎是错误的：

loc

因此，对于您的示例，我们不能依赖iloc。相反，多索引索引定位器df2_indices = set(df2.index.get_values()) df2_levels = len(df2.index.levels) indices = [idx for idx, index in enumerate(df1.index) if index[:df2_levels] in df2_indices] print(df1.iloc[indices, :]) Value ID1 ID2 ID3 1 1001 1 1 2 2 2 1002 2 4仍然按预期工作。但是，它需要您获取相应的索引位置，如下所示：

desired_rows

更新15.07.2017

更简单的解决方案是简单地将loc元组转换为列表，因为df1.loc[list(desired_rows), :] Value ID1 ID2 ID3 1 1001 1 1 2 2 2 1002 2 4更符合列表作为行定位器：

{{1}}

Answer 2

您可以通过传递单个索引级别值来执行此操作，然后针对不存在的第3个索引级别传递slice(None)：

In [107]:
df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)]

Out[107]:
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1001 1        3
         2        4

然后我们可以看到所有值匹配：

In [111]:
all(df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)] == expected_result)

Out[111]:
True

问题在于，因为索引不是相同的维度，您需要为不存在的第3级指定要传递的内容，这里传递slice(None)将选择该级别的所有行，因此掩码将起作用< / p>

Answer 3

一种方法是暂时减小较高维度索引的维度，然后进行同维过滤：

 result = (df1.reset_index().set_index(['ID1', 'ID2']).loc[df2.index]
           .reset_index().set_index(['ID1', 'ID2', 'ID3']))
 assert all(result == expected_result) # will pass

但它非常复杂。

Answer 4

列表比较可以使用isin完成：首先从较高维数据帧的索引中删除其他维度，然后将遗骸与较低维度数据的索引进行比较。在这种情况下：

 mask = df1.index.droplevel(2).isin(df2.index)
 assert all(df1[mask] == expected_result) # passes

根据较低的维度值过滤多索引数据帧

4 个答案:

更新15.07.2017