Question

抱歉，这不是一个好头衔。举个简单的例子：

（熊猫版本0.16.1）

df = pd.DataFrame({ 'x':range(1,5), 'y':[1,1,1,9] })

工作正常：

df.apply( lambda x: x > x.mean() )

       x      y
0  False  False
1  False  False
2   True  False
3   True   True

这项工作不应该一样吗？

df.apply( lambda x: x.mean() < x )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-467-6f32d50055ea> in <module>()
----> 1 df.apply( lambda x: x.mean() < x )

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3707                     if reduce is None:
   3708                         reduce = True
-> 3709                     return self._apply_standard(f, axis, reduce=reduce)
   3710             else:
   3711                 return self._apply_broadcast(f, axis)

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3797             try:
   3798                 for i, v in enumerate(series_gen):
-> 3799                     results[i] = func(v)
   3800                     keys.append(v.name)
   3801             except Exception as e:

<ipython-input-467-6f32d50055ea> in <lambda>(x)
----> 1 df.apply( lambda x: x.mean() < x )

C:\Users\ei\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
    586             return NotImplemented
    587         elif isinstance(other, (np.ndarray, pd.Index)):
--> 588             if len(self) != len(other):
    589                 raise ValueError('Lengths must match to compare')
    590             return self._constructor(na_op(self.values, np.asarray(other)),

TypeError: ('len() of unsized object', u'occurred at index x')

对于一个反例，这些都有效：

df.mean() < df

df > df.mean()

Answer 1

修改

最后发现了这个错误 - Issue 9369

正如问题所示 -

left = 0＆gt; s工作（例如python标量）。所以我认为这是存在的被视为0-dim数组（它是一个np.int64）（而不是作为标量叫做。）我会把它标记为一个bug。随意挖掘

在比较运算符左侧使用具有numpy数据类型（如np.int64或np.float64等）的比较运算符时，会出现此问题。对@santon在他的回答中提到的一个简单的解决方法是将数字转换为python标量，而不是使用numpy标量。

旧：

我试过Pandas 0.16.2。

我在原来的df上做了以下事情 -

In [22]: df['z'] = df['x'].mean() < df['x'] In [23]: df Out[23]: x y z 0 1 1 False 1 2 1 False 2 3 1 True 3 4 9 True In [27]: df['z'].mean() < df['z'] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-27-afc8a7b869b4> in <module>() ----> 1 df['z'].mean() < df['z'] C:\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis) 586 return NotImplemented 587 elif isinstance(other, (np.ndarray, pd.Index)): --> 588 if len(self) != len(other): 589 raise ValueError('Lengths must match to compare') 590 return self._constructor(na_op(self.values, np.asarray(other)), TypeError: len() of unsized object

对我来说似乎是一个错误，我可以将布尔均值与int进行比较，反之亦然，但是当使用布尔均值与布尔值时才会出现问题（尽管我认为对布尔值采用mean（）并不合理） -

In [24]: df['z'] < df['x'] Out[24]: 0 True 1 True 2 True 3 True dtype: bool In [25]: df['z'] < df['x'].mean() Out[25]: 0 True 1 True 2 True 3 True Name: z, dtype: bool In [26]: df['x'].mean() < df['z'] Out[26]: 0 False 1 False 2 False 3 False Name: z, dtype: bool

我在Pandas 0.16.1中尝试并复制了这个问题，它也可以使用 -
进行复制
In [10]: df['x'].mean() < df['x'] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-10-4e5dab1545af> in <module>() ----> 1 df['x'].mean() < df['x'] /opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/ops.pyc in wrapper(self, other, axis) 586 return NotImplemented 587 elif isinstance(other, (np.ndarray, pd.Index)): --> 588 if len(self) != len(other): 589 raise ValueError('Lengths must match to compare') 590 return self._constructor(na_op(self.values, np.asarray(other)), TypeError: len() of unsized object In [11]: df['x'] < df['x'].mean() Out[11]: 0 True 1 True 2 False 3 False Name: x, dtype: bool

似乎这也是一个已在Pandas版本0.16.2中修复的错误（混合布尔与整数时除外）。我建议使用 -
升级你的熊猫版本
pip install pandas --upgrade

Answer 2

我认为这与大于运算符的重载方式有关。使用重载函数时，如果左侧或右侧的数据类型不同，则顺序很重要。（Python有一个复杂的方法来确定要使用哪个重载函数。）您可以通过将mean()（numpy.float64）的结果转换为简单的float来使代码工作：

df.apply( lambda x: float(x.mean()) < x )

出于某种原因，似乎pandas代码将numpy.float64视为一个数组，这可能就是它失败的原因。

为什么比较顺序对此适用/ lambda不等式有影响？

2 个答案: