我是Python和程序设计的初学者。我正在尝试编写一个程序,该程序遍历特定的numpy数组,并检测数据集中的异常(异常的定义是大于平均值的3倍平均值,而没有数据点的平均值)。每次删除异常数据点时,我都需要重新计算平均值和标准偏差。
我已经编写了以下代码,但是注意到了两个问题。循环遍历一次后,它声明删除了160的值,但是当我打印new_array时,仍然在数组中看到160。
此外,每次删除数据点时,如何重新计算新的均值?我觉得有些东西在for循环中的位置不正确。最后,我对continue的使用是正确的还是应该放在其他位置?
import numpy as np
data_array = np.array([
99.5697438 , 94.47019021, 55., 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 160.,
110.61519268, 112.94716398, 104.41867586])
for cell in data_array:
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
if cell > upper_anomaly_point or cell < lower_anomaly_point:
print(str(cell) + 'has been removed.')
new_array = np.delete(data_array, cell)
continue
答案 0 :(得分:2)
我认为您应该看到Numpy Documentation并参考第一行,他们特别指出它返回不符合arr [obj]的所有元素,这意味着numpy.delete()
适用于基于索引的方式。
我建议您编辑代码,以获取该单元格的索引,然后将其传递到np.delete()
以下是编辑后的代码:
import numpy as np
data_array = np.array([99.5697438, 94.47019021, 55.0, 106.86672855, 102.78730151, 131.85777845, 88.25376895, 96.94439838, 83.67782174, 115.57993209, 118.97651966, 94.40479467, 79.63342207, 77.88602065, 96.59145004, 99.50145353, 97.25980235, 87.72010069, 101.30597215, 87.3110369, 110.0687946, 104.71504012, 89.34719772, 160.0, 110.61519268, 112.94716398, 104.41867586])
print(data_array)
for cell in data_array:
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
if cell > upper_anomaly_point or cell < lower_anomaly_point:
print(str(cell) + 'has been removed.')
index=np.where(data_array==cell)
new_array = np.delete(data_array, obj=index)
continue
答案 1 :(得分:1)
正如@damagedcoda所说,您的主要错误是应该使用索引而不是值,但是如果您要在周期内重新计算lower_anomaly_point和upper_anomaly_point,则会遇到新的问题。因此,我建议您尝试使用np.where来解决您的任务:
import numpy as np
data_array = np.array([
99.5697438 , 94.47019021, 55., 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 160.,
110.61519268, 112.94716398, 104.41867586])
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
data_array = data_array[
np.where(
(upper_anomaly_point > data_array) & (data_array > lower_anomaly_point)
)]
结果是:
array([ 99.5697438 , 94.47019021, 55. , 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 110.61519268,
112.94716398, 104.41867586])
答案 2 :(得分:0)
该代码对我来说失败了。 data_array不会更改,np.delete返回新数组,它不会更改旧数组。 您不在代码的任何地方使用new_array,您可能想从new_array计算平均值 delete的第二个参数应为索引,“指示要删除的子数组”。您不能使用单元格。
import numpy as np
data_array = np.array([
99.5697438 , 94.47019021, 55., 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 160.,
110.61519268, 112.94716398, 104.41867586])
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
new_array = data_array.copy()
k = 0
for i, cell in enumerate(data_array):
if cell > upper_anomaly_point or cell < lower_anomaly_point:
print(str(cell) + 'has been removed.')
new_array = np.delete(new_array, i - k)
k += 1
new_array是不带160的data_array。