查找异常值的功能

时间:2019-12-26 18:17:40

标签: python

我有这个数据框:

DF1

具有以下列:

obs_1   obs_2
31  173
16  20
38  49
12  16
45  49
14  174
83  88
43  46
43  46
27  45
32  40
625 669
4   4
61  99
20  26
103 -356
8   110
146 246
38  50
11  92
10  97
9   90
217 234
9   177
28  28
22  22
12  123
35  147
59  63
31  143
18  130
45  55
46  50
21  21
17  152
63  70
52  73
24  24
15  -1172
43  54
88  96
22  34
42  56
14  56
19  20
40  42
23  120
68  73
80  -1263
14  124
35  41
40  176
13  52
21  26
22  102
43  -1325
18  18
36  162
68  69
17  34
20  30
26  27
45  55
78  82

我正在尝试查找异常值,请注意使用此功能在新列中是否存在异常值:

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:, None]
        median = np.median(points, axis=0)
        diff = np.sum((points - median)**2, axis=-1)
        diff = np.sqrt(diff)
        med_abs_deviation = np.median(diff)

        modified_z_score = 0.6745 * diff / med_abs_deviation

        return modified_z_score > thresh

在这里讨论:Link to discussion

我尝试了以下代码:

DF1['obs_1_outlier'] =  is_outlier(df1.obs_1.to_numpy())

我没有收到任何错误,但是所有结果都是FALSE,并且我怀疑该函数中的某些内容无法正确计算。

我觉得这是我将列发送到函数的方式,但是我无法将手指放在上面。

0 个答案:

没有答案