如何使用KNN / K-means在数据帧中聚类时间序列

时间:2017-07-06 09:33:51

标签: python time-series cluster-analysis

假设一个包含1000行的数据帧。每行代表一个时间序列。

然后我构建了一个DTW算法来计算2行之间的距离。

我不知道接下来要做什么来为数据帧完成无监督的分类任务。

如何标记数据框的所有行?

2 个答案:

答案 0 :(得分:7)

解释

  

KNN算法 = K-最近邻分类算法

     

K-means =基于质心的聚类算法

     

DTW =动态时间扭曲时间序列的相似性度量算法

我将逐步介绍如何构建两个时间序列以及如何计算动态时间扭曲(DTW)算法。您可以使用scikit-learn构建无监督的k-means聚类,而无需指定质心数,然后scikit-learn知道使用名为auto的算法。

构建时间序列并计算DTW

您有两个时间序列,您可以计算DTW

enter image description here enter image description here

import pandas as pd
import numpy as np
import random
from dtw import dtw
from matplotlib.pyplot import plot
from matplotlib.pyplot import imshow
from matplotlib.pyplot import cm

from sklearn.cluster import KMeans
from sklearn.preprocessing import MultiLabelBinarizer 
#About classification, read the tutorial
#http://scikit-learn.org/stable/tutorial/basic/tutorial.html


def createTs(myStart, myLength):
    index = pd.date_range(myStart, periods=myLength, freq='H'); 
    values= [random.random() for _ in range(myLength)];
    series = pd.Series(values, index=index);  
    return(series)


#Time series of length 30, start from 1/1/2000 & 1/2/2000 so overlap
myStart='1/1/2000'
myLength=30
timeS1=createTs(myStart, myLength)
myStart='1/2/2000'
timeS2=createTs(myStart, myLength) 

#This could be your dataframe but unnecessary here
#myDF = pd.DataFrame([x for x in timeS1.data], [x for x in timeS2.data])#, columns=['data1', 'data2'])

x=[xxx*100 for xxx in sorted(timeS1.data)]
y=[xx for xx in timeS2.data]

choice="dtw"

if (choice="timeseries"):
    print(timeS1)
    print(timeS2)
if (choice=="drawingPlots"):
    plot(x)
    plot(y)
if (choice=="dtw"):
    #DTW with the 1st order norm
    myDiff=[xx-yy for xx,yy in zip(x,y)]
    dist, cost, acc, path = dtw(x, y, dist=lambda x, y: np.linalg.norm(myDiff, ord=1))
    imshow(acc.T, origin='lower', cmap=cm.gray, interpolation='nearest')
    plot(path[0], path[1], 'w')

KNN时间序列的分类

关于应该贴上什么标签以及使用哪些标签的问题并不明显?所以请提供以下详细信息

  • 我们应该在数据框中标注什么? DTW算法计算的路径?
  • 哪种类型的标签?二进制?多类?

之后我们可以决定我们的分类算法,也就是所谓的KNN算法。它的工作原理是你有两个独立的数据集:训练集和测试集。通过训练集,您可以教算法标记时间序列,而测试集是一个工具,通过该工具我们可以测量模型与模型选择工具(如AUC)的工作情况。

小谜题保持开放,直到提供有关问题的详细信息

#PUZZLE
#from tutorial (#http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
newX = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
newY = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
newY = MultiLabelBinarizer().fit_transform(newY)
#Continue to the article.

关于分类器的Scikit-learn比较文章在下面的第二个枚举项中提供。

使用K-means进行聚类(与KNN不同)

K-means是聚类算法,你可以使用它的无监督版本

#Unsupervised version "auto" of the KMeans as no assignment for the n_clusters
myClusters=KMeans(path)
#myClusters.fit(YourDataHere)

这是与KNN算法非常不同的算法:这里我们不需要任何标签。我在第一个枚举项目中为您提供了有关此主题的更多材料。

进一步阅读

  1. Does K-means incorporate the K-nearest-neighbour algorithm?

  2. scikit中分类器的比较了解here

答案 1 :(得分:1)

您可以使用DTW。实际上,我的一个项目遇到了同样的问题,并且我用Python编写了自己的类。

这是逻辑;

  1. 创建所有群集组合。 k是簇数,n是系列数。返回的项目数应为n! / k! / (n-k)!。这些就像潜在的中心。
  2. 对于每个系列,计算每个聚类组中每个中心的距离,并将其分配给最小的聚类。
  3. 对于每个群集组,计算单个群集内的总距离。
  4. 选择最小值。

还有代码;

import numpy as np
import pandas as pd
from itertools import combinations
import time

def dtw_distance(x, y, d=lambda x,y: abs(x-y), scaled=False, fill=True):
    """Finds the distance of two arrays by dynamic time warping method
    source: https://en.wikipedia.org/wiki/Dynamic_time_warping

    Dependencies:
        import numpy as np
    Args:
        x, y: arrays
        d: distance function, default is absolute difference
        scaled: boolean, should arrays be scaled before calculation
        fill: boolean, should NA values be filled with 0
    returns:
        distance as float, 0.0 means series are exactly same, upper limit is infinite
    """
    if fill:
        x = np.nan_to_num(x)
        y = np.nan_to_num(y)
    if scaled:
        x = array_scaler(x)
        y = array_scaler(y)
    n = len(x) + 1
    m = len(y) + 1
    DTW = np.zeros((n, m))
    DTW[:, 0] = float('Inf')
    DTW[0, :] = float('Inf')
    DTW[0, 0] = 0

    for i in range(1, n):
        for j in range(1, m):
            cost = d(x[i-1], y[j-1])
            DTW[i, j] = cost + min(DTW[i-1, j], DTW[i, j-1], DTW[i-1, j-1])

    return DTW[n-1, m-1]

def array_scaler(x):
    """Scales array to 0-1

    Dependencies:
        import numpy as np
    Args:
        x: mutable iterable array of float
    returns:
        scaled x
    """
    arr_min = min(x)
    x = np.array(x) - float(arr_min)
    arr_max = max(x)
    x = x/float(arr_max)
    return x


class TrendCluster():
    def __init__(self):
        self.clusters = None
        self.centers = None
        self.scale = None

    def fit(self, series, n=2, scale=True):
        '''
        Work-flow
        1 - make series combination with size n, initial clusters
        2 - assign closest series to each cluster
        3 - calculate total distance for each combinations
        4 - choose the minimum

        Args:
            series: dict, keys can be anything, values are time series as list, assumes no nulls
            n: int, cluster size
            scale: bool, if scale needed
        '''
        assert isinstance(series, dict) and isinstance(n, int) and isinstance(scale, bool), 'wrong argument type'
        assert n < len(series.keys()), 'n is too big'
        assert len(set([len(s) for s in series.values()])) == 1, 'series length not same'

        self.scale = scale

        combs = combinations(series.keys(), n)
        combs = [[c, -1] for c in combs]

        series_keys = pd.Series(series.keys())
        dtw_matrix = pd.DataFrame(series_keys.apply(lambda x: series_keys.apply(lambda y: dtw_distance(series[x], series[y], scaled=scale))))
        dtw_matrix.columns, dtw_matrix.index = series_keys, series_keys
        for c in combs:
            c[1] = dtw_matrix.loc[c[0], :].min(axis=0).sum()

        combs.sort(key=lambda x: x[1])
        self.centers = {c:series[c] for c in combs[0][0]}
        self.clusters = {c:[] for c in self.centers.keys()}

        for k, _ in series.items():
            tmp = [[c, dtw_matrix.loc[k, c]] for c in self.centers.keys()]
            tmp.sort(key=lambda x: x[1])
            cluster = tmp[0][0]
            self.clusters[cluster].append(k)

        return None

    def assign(self, serie, save=False):
        '''
        Assigns the serie to appropriate cluster

        Args:
            serie, dict: 1 element dict
            save, bool: if new serie is stored to clusters

        Return:
            str, assigned cluster key
        '''
        assert isinstance(serie, dict) and isinstance(save, bool), 'wrong argument type'
        assert len(serie) == 1, 'serie\'s length is not exactly 1'

        tmp = [[c, dtw_distance(serie.values()[0], self.centers[c], scaled=self.scale)] for c in self.centers.keys()]
        tmp.sort(key=lambda x: x[1])
        cluster = tmp[0][0]
        if save:
            self.clusters[cluster].append(serie.keys()[0])

        return cluster

如果您希望在实际操作中看到它,可以参考我的存储库中的Time Series Clustering