Question

我想比较（即＆gt;或＆lt;）数据框的一列中的值，它具有单个非唯一索引（df1）到另一个数据帧的值（{ {1}}）具有唯一的多索引。比较应该使df2中的值与df1中的每个值匹配，并与第一级的相应索引值匹配。

E.g。如果df1中的值小于df2中的任何值，其中第一级索引与df1中的值的索引匹配，则结果应显示为“True＆＃39;”。示例代码将使这个明确

df2

我想要的是如下：

index_a = [1,2,2,3,3,3]
index_b = [0,0,1,0,1,2]
index = pd.MultiIndex.from_arrays([index_a,index_b], names=('a','b'))
df1 = pd.DataFrame(np.random.rand(4,), index = [1,2,3,3], columns=['p'])    
>>> df1
          p
a          
1  0.672379
2  0.130578
3  0.128918
3  0.346115

df2 = pd.DataFrame(np.random.rand(6,), index=index, columns=['p'])
>>> df2
            p
a b          
1 0  0.187448
2 0  0.596792
  1  0.075301
3 0  0.784842
  1  0.256178
  2  0.691007

这可以在没有取消堆叠dfexp = df2.unstack('b') >>> dfexp p b 0 1 2 a 1 0.187448 NaN NaN 2 0.596792 0.075301 NaN 3 0.784842 0.256178 0.691007 >>> comp = dfexp.ge(df1.p,axis=0) >>> comp p b 0 1 2 a 1 False False False 2 True False False 3 True True True 3 True False True >>> comp.any(axis=1) a 1 False 2 True 3 True 3 True dtype: bool的情况下实现吗？问题是，对于某些第一级索引，二级标签可能非常大，使得操作非常慢并且df2数据框不必要地大。同时expdf的索引也可能非常大，因此我希望通过循环遍历该索引来避免解决该任务，除非该循环可以非常非常快地完成。

Answer 1

设置

index_a = [1,2,2,3,3,3]
index_b = [0,0,1,0,1,2]
index = pd.MultiIndex.from_arrays([index_a,index_b], names=('a','b'))
np.random.seed(3)
df1 = pd.DataFrame(np.random.rand(4,), index = [1,2,3,3], columns=['p'])
df2 = pd.DataFrame(np.random.rand(6,), index=index, columns=['p'])

尝试：

df1.apply(lambda x: (df2.loc[x.name, 'p'] > x.get_value('p')).any(), axis=1)

1     True
2     True
3     True
3    False
dtype: bool

python - 将系列/数据帧与多索引数据帧进行比较

1 个答案:

设置