Question

我有一个pandas.DataFrame对象，它包含大约100列和200000行数据。我试图将其转换为bool数据帧，其中True表示该值大于阈值，False表示它更小，并且保持NaN值。

如果没有NaN值，我运行大约需要60毫秒：

df >= threshold

但是当我尝试处理NaN时，下面的方法有效，但速度非常慢（20秒）。

def func(x):
    if x >= threshold:
        return True
    elif x < threshold:
        return False
    else:
        return x
df.apply(lambda x: x.apply(lambda x: func(x)))

有更快的方法吗？

Answer 1

你可以这样做：

new_df = df >= threshold
new_df[df.isnull()] = np.NaN

但这与使用apply方法得到的不同。这里你的面具有浮点数dtype，包含NaN，0.0和1.0。在应用解决方案中，您获得object dtype，其中包含NaN，False和True。

也不能用作面具，因为你可能得不到你想要的东西。 IEEE表示任何NaN比较必须产生False，并且apply方法隐含地违反了返回NaN！

最佳选择是分别跟踪NaN，并且在安装瓶颈时df.isnull（）非常快。

Answer 2

您可以使用以下帖子单独检查NaN：Python - find integer index of rows with NaN in pandas

df.isnull()

使用按位或：

将isnull的输出与df >= threshold合并

df.isnull() | df >= threshold

你可以预计这两个面具可以接近200毫秒来计算和组合，但这应该远远超过20秒就可以了。

Answer 3

在这种情况下，我使用浮点数的指标数组，编码为：0 = False，1 = True，NaN =缺失。具有bool dtype的Pandas DataFrame不能具有缺失值，并且具有object dtype的DataFrame包含Python bool和float对象的混合效率不高。这导致我们使用带有np.float64 dtype的DataFrame。 numpy.sign(x - threshold)为您的比较提供-1 =（x <阈值），0 =（x ==阈值）和+1 =（x>阈值），这可能足以满足您的目的，但如果您真的需要0/1编码，转换可以就地进行。下面的时间是200K长度的数组x：

In [45]: %timeit y = (x > 0); y[pd.isnull(x)] = np.nan
100 loops, best of 3: 8.71 ms per loop

In [46]: %timeit y = np.sign(x)
100 loops, best of 3: 1.82 ms per loop

In [47]: %timeit y = np.sign(x); y += 1; y /= 2
100 loops, best of 3: 3.78 ms per loop

保持NaNs与pandas数据帧不等式

3 个答案: