KNeighborsClassifier比较不同大小的列表吗?

时间:2014-07-06 04:12:11

标签: python machine-learning time-series scikit-learn knn

我必须使用Scikit Lean的KNeighborsClassifier来比较Python中用户定义函数的时间序列。

knn = KNeighborsClassifier(n_neighbors=1,weights='distance',metric='pyfunc',func=dtw_dist)

问题是KNeighborsClassifier似乎不支持我的训练数据。它们是时间序列,因此它们是不同大小的列表。当我尝试使用fit方法(knn.fit(X,Y))时,KNeighborsClassifier给出了此错误消息:

ValueError: data type not understood

似乎KNeighborsClassifier只支持相同大小的训练集(只接受相同长度的时间序列,但这不是我的情况),但我的老师告诉我使用KNeighborsClassifier。所以我不知道该怎么做......

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

据我所知,有两个(或一个......)选项:

  1. 预计算距离(KNeighborsClassifier不直接支持,其他群集算法也是如此,例如Spectral Clustering)。
  2. 使用NaN将数据转换为方形,并在自定义距离函数中相应地处理这些数据。
  3. '广场'您的数据使用NaN s

    所以,选项2就是。 假设我们有以下数据,其中每一行代表一个时间序列:

    import numpy as np
    
    series = [
        [1,2,3,4],
        [1,2,3],
        [1],
        [1,2,3,4,5,6,7,8]
    ]
    

    我们只是通过添加nans来使数据平方:

    def make_square(jagged):
        # Careful: this mutates the series list of list
        max_cols = max(map(len, jagged))
        for row in jagged:
            row.extend([None] * (max_cols - len(row)))
        return np.array(jagged, dtype=np.float)
    
    
    make_square(series)
    array([[  1.,   2.,   3.,   4.,  nan,  nan,  nan,  nan],
           [  1.,   2.,   3.,  nan,  nan,  nan,  nan,  nan],
           [  1.,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
           [  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.]])
    

    现在数据适合'进入算法。您只需调整距离函数以考虑NaN s。

    预先计算并使用缓存功能

    哦,我们也可以做选项1(假设你有N时间序列):

    1. 预先将距离计算为(N, N)距离矩阵D
    2. 创建(N, 1)矩阵,该矩阵只是[0, N)之间的范围(即距离矩阵中系列的索引)
    3. 创建距离函数wrapper
    4. 使用此wrapper作为距离函数。
    5. wrapper功能:

      def wrapper(row1, row2):
          # might have to fiddle a bit here, but i think this retrieves the indices.
          i1, i2 = row1[0], row2[0]
          return D[i1, i2]
      

      好的,我希望它清楚。

      完整示例

      #!/usr/bin/env python2.7
      # encoding: utf-8
      '''
      '''
      from mlpy import dtw_std # I dont know if you are using this one: it doesnt matter.
      from sklearn.neighbors import KNeighborsClassifier
      import numpy as np
      
      # Example data
      series = [
          [1, 2, 3, 4],
          [1, 2, 3, 4],
          [1, 2, 3, 4],
          [1, 2, 3],
      
          [1],
      
          [1, 2, 3, 4, 5, 6, 7, 8],
          [1, 2, 5, 6, 7, 8],
          [1, 2, 4, 5, 6, 7, 8],
      ]
      
      # I dont know.. these seemed to make sense to me!
      y = np.array([
          0,
          0,
          0,
          0,
      
          1,
      
          2,
          2,
          2
      ])
      
      # Compute the distance matrix
      N = len(series)
      D = np.zeros((N, N))
      
      for i in range(N):
          for j in range(i+1, N):
              D[i, j] = dtw_std(series[i], series[j])
              D[j, i] = D[i, j]
      
      print D
      
      # Create the fake data matrix: just the indices of the timeseries
      X = np.arange(N).reshape((N, 1))
      
      
      # Create the wrapper function that returns the correct distance
      def wrapper(row1, row2):
          # cast to int to prevent warnings: sklearn converts our integer indices to floats.
          i1, i2 = int(row1[0]), int(row2[0])
          return D[i1, i2]
      
      # Only the ball_tree algorith seems to accept a custom function
      knn = KNeighborsClassifier(weights='distance', algorithm='ball_tree', metric='pyfunc', func=wrapper)
      knn.fit(X, y)
      print knn.kneighbors(X[0])
      # (array([[ 0.,  0.,  0.,  1.,  6.]]), array([[1, 2, 0, 3, 4]]))
      print knn.kneighbors(X[0])
      # (array([[ 0.,  0.,  0.,  1.,  6.]]), array([[1, 2, 0, 3, 4]]))
      
      print knn.predict(X)
      # [0 0 0 0 1 2 2 2]