Question

我有一个熊猫数据框，其中有很多列（> 100）。我对所有列的值进行标准化处理，以便每一列都以0为中心（它们的均值为0和std 1）。考虑到所有列，我想摆脱低于-2和高于2的所有行。我的意思是，在第一列中，第2、3、4行是离群值，在第二列中，第3、4、5、6行是离群值。然后，我想摆脱行[2,3,4,5,6]。

我想做的是使用for循环传递每列，并收集异常值的行索引并将其存储在列表中。最后，我有一个包含每个列表的行索引的列表。我获得了唯一的值，以获得应该摆脱的行索引。我的问题是我不知道如何对数据帧进行切片，因此它不包含这些行。我当时在考虑使用％in％运算符，但是它不接受列表＃中的格式＃列表。我在下面显示我的代码。

### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2. 
'''                                          
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []

for i in range(n_cols):
    variable = aux_features[:,i] # We take one column at a time
    condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
    index = np.where(condition)
    outliers_index.append(index)

outliers = [j for i in outliers_index for j in i]

outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.

total_index = list(range(n_rows))

aux = (total_index in unique_index)

outliers_2包含一个包含所有行索引的列表（包括重复），然后在unique_index中，我仅获得唯一值，因此我以所有具有异常值的行索引结尾。我被困在这一部分。如果有人知道如何完成或更好地了解如何消除这些异常值（我想我的方法对于非常大的数据集将非常耗时）

Answer 1

df = pd.DataFrame(np.random.standard_normal(size=(1000, 5)))  # example data
cleaned = df[~(np.abs(df) > 2).any(1)]

说明：

过滤数据框以获取高于和低于2的值。返回包含布尔表达式的数据框：

np.abs(df) > 2

检查行是否包含异常值。对于存在异常值的每一行，将其评估为True：

(np.abs(df) > 2).any(1)

最后使用~运算符选择所有行而不包含异常值：

 df[~(np.abs(df) > 2).any(1)]

摆脱多列熊猫数据框中的异常行

1 个答案: