我早些时候问过一个有关创建有效循环以找到异常值的问题,这里有人提供了很好的答案。但是,我现在正在编写此代码以在numpy数组中查找异常值。对于此程序,假定array = x_a且数据点为x。 Mu是x_a的平均值,没有数据点,而sigma是x_a的标准偏差,没有数据点。如果数据点是异常,则将其从x_a中删除。在下面的代码中,我正在尝试执行此操作。我创建了数据集x_a和x_trimmed(假定是根据需要而没有点x的数据集)。下面是代码和输出。
import numpy as np
x_a = np.array([([99.5697438, 94.47019021, 55.0, 106.86672855, 102.78730151,
131.85777845, 88.25376895, 96.94439838, 83.67782174,
115.57993209, 118.97651966, 94.40479467, 79.63342207, 77.88602065,
96.59145004, 99.50145353, 97.25980235, 87.72010069, 101.30597215,
87.3110369, 110.0687946, 104.71504012, 89.34719772, 160.0,
110.61519268, 112.94716398, 104.41867586])
outliers=True
while outliers:
#Define mean and std of the dataset
x_mu = np.mean(x_a, axis=0)
x_std = np.std(x_a, axis=0)
#Define the dataset WITHOUT the data point, and calculate the mean and std WITHOUT the datapoint
x_trimmed = [x for x in x_a if (x < x_mu + (3 * x_std)) or (x > x_mu - (3 * x_std))]
trim_mu = np.mean(x_trimmed, axis=0)
trim_std = np.std(x_trimmed, axis=0)
for cell in x_a:
if cell > x_mu + (3 * x_std) or cell < x_mu - (3 * x_std):
print("Removed the data point " + str(cell))
index=np.where(x_a==cell)
x_a = np.delete(x_a, obj=index)
if np.array_equal(x_a, x_trimmed):
print("No more outlier detected!")
outliers=False
但是,输出如下:
Removed the data point 160.0
No more outlier detected!
我已经手动删除了数据点55.0和131.85777845,发现点131.85777845的确与平均值之间存在大约3.07标准偏差。预期的输出是应删除160、55和131.85777845。
要显示正确的输出,需要对代码进行哪些修改?