Question

这听起来像个愚蠢的问题，但是pandas Int64Index.intersection的语义是什么？我以为它应该与两组相交，无论有没有重复;但请参阅下面的示例。我正在使用熊猫0.14.1。

更新：这看起来像是熊猫中的一个错误。请注意，它在0.13中引发异常：不允许将具有重复值的索引交叉。

更新： 此错误似乎已在pandas 0.15中修复 。另外0.15的发行说明提到了重复索引的各种问题。

示例：

>>> pd.Int64Index([8,4,8]).intersection(pd.Int64Index([5]))
Int64Index([8], dtype='int64')
>>> pd.Int64Index([3,4,8]).intersection(pd.Int64Index([5]))
Int64Index([], dtype='int64')

为什么当我将8, 4, 8与5相交时，我得到8？它与重复值有关：

>>> pd.Int64Index([1,1,2]).intersection(pd.Int64Index([3]))
Int64Index([], dtype='int64')
>>> pd.Int64Index([2,1,1]).intersection(pd.Int64Index([3]))
Int64Index([1], dtype='int64')

更新

我在看env/lib/python2.7/site-packages/pandas/core/index.py(1318)get_indexer_non_unique()。首先它到达1086行，因为有重复。

   1081         try:
   1082             indexer = self.get_indexer(other.values)
   1083             indexer = indexer.take((indexer != -1).nonzero()[0])
   1084         except:
   1085             # duplicates
-> 1086             indexer = self.get_indexer_non_unique(other.values)[0].unique()
   1087 
   1088         taken = self.take(indexer)
   1089         if self.name != other.name:
   1090             taken.name = None

此时

ipdb> p self
Int64Index([2, 2, 1], dtype='int64')
ipdb> p other
Int64Index([3], dtype='int64')

然后此函数返回行1319，

1  1318         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
-> 1319         return Index(indexer), missing

这就是它的回报：

ipdb> p indexer
array([-1])
ipdb> p missing
array([0])

之后，应用索引器来获取数组的最后一个元素。为什么？不知道。

实际上，它在同一个文件中说

def get_indexer_non_unique(self, target, **kwargs):
    """ return an indexer suitable for taking from a non unique index
        return the labels in the same order as the target, and
        return a missing indexer into the target (**missing are marked as -1
        in the indexer**); target must be an iterable """

所以缺少-1，但它用于提取最后一个元素。它是熊猫中的一个错误吗？这看起来非常引人注目，我依靠这个intersection到处都是。是否始终包括最佳元素？

Pandas（0.14中的错误？）Int64Index：重复索引的“交集”

0 个答案: