Question

我尝试将pandas DataFrame对象转换为包含基于某些简单阈值的点分类的新对象：

如果该点为AMyStaticMeshActor a; // C++98/03 AMyStaticMeshActor a{}; //C++11 and up

0

如果该点为负数或0

NaN

如果值超出基于整个列的某些条件，则值转换为1
值为2，否则

这是一个非常简单的自包含示例：

enter image description here

到目前为止创建的转型过程：

import pandas as pd
import numpy as np

df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]})

print(df)

导致：

enter image description here

我尝试使用.applymap()和/或.groupby()来加快流程，但没有运气。我在this answer中找到了一些指导，但是，当您未在大熊猫列中进行分组时，我仍然不确定#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers' outliers = pd.DataFrame() for datapoint in df.columns: tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())]) outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer') outliers[outliers.isnull() == False] = 2 #Classify everything else as "3" df[df > 0] = 3 #Classify negative and zero points as a "1" df[df <= 0] = 1 #Update with the outliers df.update(outliers) #Everything else is a "0" df.fillna(value=0, inplace=True)如何有用。

Answer 1

这是异常值部分的替代品。它在我的计算机上的样本数据快了大约5倍。

>>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns )

    a   b
0 NaN   2
1   2   3
2   3  -4
3   4   5
4   5   6
5   0   2
6  -7   7
7   9   9
8  10 NaN

您也可以使用apply，但它会比np.where方法慢（但速度与您当前正在进行的速度大致相同），但要简单得多。这可能是一个很好的例子，说明为什么在关心速度时应尽可能避免apply。

>>> df[ df.apply( lambda x: abs(x) > x.mean() ) ] = 2

你也可以这样做，这比apply快但比np.where慢：

>>> mask = np.abs(df) > df.mean()
>>> df[mask] = 2

当然，这些东西并不总是线性扩展，因此请根据您的实际数据进行测试，看看它们是如何比较的。

使用.apply，.applymap，.groupby转换Pandas DataFrame中的异常值

1 个答案: