我有这个数据框:
DF1
具有以下列:
obs_1 obs_2
31 173
16 20
38 49
12 16
45 49
14 174
83 88
43 46
43 46
27 45
32 40
625 669
4 4
61 99
20 26
103 -356
8 110
146 246
38 50
11 92
10 97
9 90
217 234
9 177
28 28
22 22
12 123
35 147
59 63
31 143
18 130
45 55
46 50
21 21
17 152
63 70
52 73
24 24
15 -1172
43 54
88 96
22 34
42 56
14 56
19 20
40 42
23 120
68 73
80 -1263
14 124
35 41
40 176
13 52
21 26
22 102
43 -1325
18 18
36 162
68 69
17 34
20 30
26 27
45 55
78 82
我正在尝试查找异常值,请注意使用此功能在新列中是否存在异常值:
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:, None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
在这里讨论:Link to discussion
我尝试了以下代码:
DF1['obs_1_outlier'] = is_outlier(df1.obs_1.to_numpy())
我没有收到任何错误,但是所有结果都是FALSE,并且我怀疑该函数中的某些内容无法正确计算。
我觉得这是我将列发送到函数的方式,但是我无法将手指放在上面。