异常检测的聚类训练

时间:2021-07-14 09:55:31

标签: python anomaly-detection data-preprocessing isolation-forest

根据类似行为对我的数据进行聚类后,我现在正在努力检测每个聚类中的异常情况。数据是一个 pandas.Dataframe() 列表,如下所示:

In [0]: ip_series
Out [0]: [                     rolling_mean
          rt                              
          2021-01-13 12:00:00      0.000000
          2021-01-13 17:00:00      0.005034
          2021-01-13 18:00:00      0.003356
          2021-01-14 00:00:00      0.003523
          2021-01-14 01:00:00      0.010067
          ...                           ...
          2021-01-31 07:00:00      0.430872
          2021-01-31 08:00:00      0.444104
          2021-01-31 09:00:00      0.390856
          2021-01-31 19:00:00      0.518255
          2021-01-31 20:00:00      0.440268

           [153 rows x 1 columns],
                                rolling_mean
          rt                              
          2021-01-13 12:00:00      0.003598
          2021-01-13 17:00:00      0.003598
          2021-01-13 18:00:00      0.000000
          2021-01-14 00:00:00      0.003598
          2021-01-14 01:00:00      0.003598
          ...                           ...
          2021-01-31 07:00:00      0.773146
          2021-01-31 08:00:00      0.773917
          2021-01-31 09:00:00      0.676952
          2021-01-31 19:00:00      0.599496
          2021-01-31 20:00:00      0.528068
          [153 rows x 1 columns],
          ...]

如您所见,数据帧由时间戳和某些值(这些已经标准化)组成。作为预处理步骤,我正在重塑数据:

In [1]: train_data = ip_series.copy()
        for i in range(len(ip_series)):
            train_data[i] = train_data[i].values.reshape(len(train_data[i]))
In [2]: train_data[0]
Out [2]: array([0.        , 0.00503356, 0.0033557 , 0.00352349, 0.01006711,
         0.01979866, 0.05378715, 0.11764142, 0.14122723, 0.16423778,
         0.1906999 , 0.2042186 , 0.3008629 , 0.34443912, 0.33494727,
         0.3596836 , 0.36917546, 0.34341443, 0.40800575, 0.37260906,
         0.33277405, 0.32063758, 0.26728188, 0.26442953, 0.24161074,
         0.21221477, 0.17775647, 0.22924257, 0.22147651, 0.19932886,
         0.18098434, 0.16328859, 0.15830537, 0.2010906 , 0.17401726,
         0.17833174, 0.43127517, 0.3590604 , 0.36931927, 0.33394056,
         0.32603068, 0.33510906, 0.31353468, 0.28540268, 0.34440716,
         0.32628635, 0.33133389, 0.35725671, 0.32718121, 0.31233221,
         0.31258389, 0.31963087, 0.30629195, 0.2886745 , 0.30488974,
         0.29798658, 0.28062081, 0.33451582, 0.32387344, 0.29697987,
         0.29043624, 0.26823266, 0.37561521, 0.53758389, 0.59261745,
         0.63199105, 0.57516779, 0.58612975, 0.65486577, 0.74421141,
         0.67181208, 0.49731544, 0.52167785, 0.33704698, 0.30241611,
         0.28791946, 0.30040268, 0.2933557 , 0.3300183 , 0.36129754,
         0.40067114, 0.36563758, 0.34996949, 0.35004794, 0.42511985,
         0.38513902, 0.35134228, 0.31722595, 0.29255034, 0.19907718,
         0.29345638, 0.29888143, 0.39986577, 0.52067114, 0.43456376,
         0.43087248, 0.36362416, 0.32550336, 0.33854267, 0.32491611,
         0.28948546, 0.23713647, 0.23214765, 0.23395973, 0.23818792,
         0.25530201, 0.25328859, 0.24181208, 0.26687004, 0.23575351,
         0.2319097 , 0.29888143, 0.61937919, 0.84161074, 0.88906999,
         0.96409396, 1.        , 0.86462128, 0.76208054, 0.77491611,
         0.53833893, 0.48903803, 0.36711409, 0.3344519 , 0.31932886,
         0.3147651 , 0.3442953 , 0.34272931, 0.30825503, 0.32295302,
         0.4541387 , 0.53255034, 0.49651007, 0.55026846, 0.53496644,
         0.51982916, 0.66241611, 0.86935123, 0.84020134, 0.7876144 ,
         0.72365772, 0.69295302, 0.64383067, 0.49530201, 0.51159243,
         0.52037828, 0.50756559, 0.35349952, 0.43087248, 0.44410355,
         0.3908557 , 0.51825503, 0.44026846])

聚类过程发生在数据 TimeSeriesKMeans 上的 train_data

km = TimeSeriesKMeans(n_clusters=72, metric='dtw')
labels = km.fit_predict(train_data)

这一步很关键,因为数据有各种各样的行为,我的目标是基于这些集群数据创建多个模型,以使用隔离森林检测每个时间序列(在每个集群中)的异常情况。因此,我正在创建一个按集群排序的数据帧列表。

df_test = pd.Dataframe(zip(train_data, labels))
df_test.cloumns['values', 'cluster]

# transform df_test into list of dataframes sorted per cluster
cluster_df_list = []
for i in set(labels):
    df_train_iforest = df_test.loc[df_test['cluster'] == i].reset_index(drop=True)
    cluster_df_list.append(df_train_iforest)

# training
for i in range(len(cluster_df_list)):
    for j in range(len(cluster_df_list[i]['values'])):
        train_data_iforest = (cluster_df_list[i]['values'][j]).reshape(-1,1)
        model = IsolationForest()
        model.fit(train_data_iforest)

        cluster_df_list[i]['anomaly'] = pd.Series(model.predict(train_data_iforest))
        cluster_df_list[i]['anomaly'] = cluster_df_list[i]['anomaly'].map({1:0, -1:1})

        anomaly_cluster_df = cluster_df_list[i].loc[cluster_df_list[i]['anomaly'] == 1].reset_index(drop=True)

我得到的是完整数组,它们被检测为异常值。但我更想要一个“经典的隔离森林”,这意味着检测集群中每个数组的异常点。我究竟做错了什么?是我的预处理不正确还是我必须以不同的方式提供模型?

TLDR:如何以集群方式训练单个模型而不检测每个集群的异常阵列,而是检测每个阵列的异常点?

0 个答案:

没有答案