K-Means GridSearchCV超参数调整

时间:2020-05-25 12:22:22

标签: python-3.x scikit-learn k-means grid-search gridsearchcv

我正在尝试通过在带有决策树分类器的管道中使用时空K均值聚类进行超参数调整。想法是使用K-Means聚类算法生成聚类距离空间矩阵和聚类标签,然后将其传递到决策树分类器。对于超参数调整,只需将参数用于K-Means算法即可。

我正在使用Python 3.8和sklearn 0.22。

我感兴趣的数据有3列/属性:“时间”,“ x”和“ y”(x和y是空间坐标)。

代码是:

class ST_KMeans(BaseEstimator, TransformerMixin):
# class ST_KMeans():
    """
    Note that K-means clustering algorithm is designed for Euclidean distances.
    It may stop converging with other distances, when the mean is no longer a
    best estimation for the cluster 'center'.

    The 'mean' minimizes squared differences (or, squared Euclidean distance).
    If you want a different distance function, you need to replace the mean with
    an appropriate center estimation.


    Parameters:

    k:  number of clusters

    eps1 : float, default=0.5
        The spatial density threshold (maximum spatial distance) between 
        two points to be considered related.

    eps2 : float, default=10
        The temporal threshold (maximum temporal distance) between two 
        points to be considered related.

    metric : string default='euclidean'
        The used distance metric - more options are
        ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’,
        ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’,
        ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘rogerstanimoto’, ‘sqeuclidean’,
        ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘yule’.

    n_jobs : int or None, default=-1
        The number of processes to start; -1 means use all processors (BE AWARE)


    Attributes:

    labels : array, shape = [n_samples]
        Cluster labels for the data - noise is defined as -1
    """

    def __init__(self, k, eps1 = 0.5, eps2 = 10, metric = 'euclidean', n_jobs = 1):
        self.k = k
        self.eps1 = eps1
        self.eps2 = eps2
        # self.min_samples = min_samples
        self.metric = metric
        self.n_jobs = n_jobs


    def fit(self, X, Y = None):
        """
        Apply the ST K-Means algorithm 

        X : 2D numpy array. The first attribute of the array should be time attribute
            as float. The following positions in the array are treated as spatial
            coordinates.
            The structure should look like this [[time_step1, x, y], [time_step2, x, y]..]

            For example 2D dataset:
            array([[0,0.45,0.43],
            [0,0.54,0.34],...])


        Returns:

        self
        """

        # check if input is correct
        X = check_array(X)

        # type(X)
        # numpy.ndarray

        # Check arguments for DBSCAN algo-
        if not self.eps1 > 0.0 or not self.eps2 > 0.0:
            raise ValueError('eps1, eps2, minPts must be positive')

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        '''
        Filter the euclidean distance matrix using time distance matrix. The code snippet gets all the
        indices of the 'time_dist' matrix in which the time distance is smaller than 'eps2'.
        Afterward, for the same indices in the euclidean distance matrix the 'eps1' is doubled which results
        in the fact that the indices are not considered during clustering - as they are bigger than 'eps1'.
        '''
        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)


        # Initialize K-Means clustering model-
        self.kmeans_clust_model = KMeans(
            n_clusters = self.k, init = 'k-means++',
            n_init = 10, max_iter = 300,
            precompute_distances = 'auto', algorithm = 'auto')

        # Train model-
        self.kmeans_clust_model.fit(dist)


        self.labels = self.kmeans_clust_model.labels_
        self.X_transformed = self.kmeans_clust_model.fit_transform(X)

        return self


    def transform(self, X):
        if not isinstance(X, np.ndarray):
            # Convert to numpy array-
            X = X.values

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

        # return self.kmeans_clust_model.transform(X)
        return self.kmeans_clust_model.transform(dist)


# Initialize ST-K-Means object-
st_kmeans_algo = ST_KMeans(
    k = 5, eps1=0.6,
    eps2=9, metric='euclidean',
    n_jobs=1
    )

Y = np.zeros(shape = (501,))

# Train on a chunk of dataset-
st_kmeans_algo.fit(data.loc[:500, ['time', 'x', 'y']], Y)

# Get clustered data points labels-
kmeans_labels = st_kmeans_algo.labels

kmeans_labels.shape
# (501,)


# Get labels for points clustered using trained model-
# kmeans_transformed = st_kmeans_algo.X_transformed
kmeans_transformed = st_kmeans_algo.transform(data.loc[:500, ['time', 'x', 'y']])

kmeans_transformed.shape
# (501, 5)

dtc = DecisionTreeClassifier()

dtc.fit(kmeans_transformed, kmeans_labels)

y_pred = dtc.predict(kmeans_transformed)

# Get model performance metrics-
accuracy = accuracy_score(kmeans_labels, y_pred)
precision = precision_score(kmeans_labels, y_pred, average='macro')
recall = recall_score(kmeans_labels, y_pred, average='macro')

print("\nDT model metrics are:")
print("accuracy = {0:.4f}, precision = {1:.4f} & recall = {2:.4f}\n".format(
    accuracy, precision, recall
    ))

# DT model metrics are:
# accuracy = 1.0000, precision = 1.0000 & recall = 1.0000




# Hyper-parameter Tuning:

# Define steps of pipeline-
pipeline_steps = [
    ('st_kmeans_algo' ,ST_KMeans(k = 5, eps1=0.6, eps2=9, metric='euclidean', n_jobs=1)),
    ('dtc', DecisionTreeClassifier())
    ]

# Instantiate a pipeline-
pipeline = Pipeline(pipeline_steps)

kmeans_transformed.shape, kmeans_labels.shape
# ((501, 5), (501,))

# Train pipeline-
pipeline.fit(kmeans_transformed, kmeans_labels)




# Specify parameters to be hyper-parameter tuned-
params = [
    {
        'st_kmeans_algo__k': [3, 5, 7]
    }
    ]

# Initialize GridSearchCV object-
grid_cv = GridSearchCV(estimator=pipeline, param_grid=params, cv = 2)

# Train GridSearch on computed data from above-
grid_cv.fit(kmeans_transformed, kmeans_labels)

“ grid_cv.fit()”调用出现以下错误:

ValueError跟踪(最近一次调用 最后) 5 6#根据上面的计算数据训练GridSearch- ----> 7 grid_cv.fit(kmeans_transformed,kmeans_labels)

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _search.py 适合(自己,X,y,组,**适合参数) 708个返回结果 709 -> 710 self._run_search(evaluate_candidates) 711 712#对于多指标评估,存储best_index_,best_params_和

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _search.py 在_run_search(self,valuate_candidates)1149中 _run_search(自我,evaluate_candidates):1150“”“在param_grid中搜索所有候选人”“” -> 1151 Evaluation_candidates(ParameterGrid(self.param_grid))1152 1153

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _search.py 在Evaluation_candidates(candidate_params)中 680个 681 -> 682出=并行(延迟(_fit_and_score)(克隆(base_estimator), 683 X,y, 684 train = train,test = test,

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py在 致电((可迭代)(自己,可迭代))剩余的#2份工作。 1003 self._iterating = False -> 1004,如果self.dispatch_one_batch(iterator):1005 self._iterating = self._original_iterator不是None 1006

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py在 dispatch_one_batch(自己,迭代器) 833返回False 834其他: -> 835 self._dispatch(任务) 836返回True 837

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py在 _dispatch(自己,批量) 752带有self._lock: 第753章 -> 754作业= self._backend.apply_async(batch,callback = cb) 755#作业完成得比其回调要快 756#在我们到达这里之前被调用,导致self._jobs发生

〜/ .local / lib / python3.8 / site-packages / joblib / _parallel_backends.py在 apply_async(self,func,callback) 207 def apply_async(self,func,callback = None): 208“”“计划要运行的功能”“” -> 209结果= InstantResult(func) 210,如果回调: 211回调(结果)

〜/ .local / lib / python3.8 / site-packages / joblib / _parallel_backends.py在 初始化(自己,批量) 588#不要延迟应用程序,以避免保持输入 589#内存中的参数 -> 590 self.results = batch() 591 592 def get(self):

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py在 致电(个体经营) 253#将默认进程数更改为-1 254与parallel_backend(self._backend,n_jobs = self._n_jobs): -> 255 return [func(* args,** kwargs) 用于self.items中的func,args,kwarg的256] 257

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py在 (.0) 253#将默认进程数更改为-1 254与parallel_backend(self._backend,n_jobs = self._n_jobs): -> 255 return [func(* args,** kwargs) 用于self.items中的func,args,kwarg的256] 257

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _validation.py 在_fit_and_score中(估算器,X,y,得分手,训练,测试,详细, 参数,fit_params,return_train_score,return_parameters, return_n_test_samples,return_times,return_estimator,error_score) 其他542 (543)_time_time()-start_time -> 544个test_scores = _score(estimator,X_test,y_test,scorer) 545 score_time = time.time()-start_time-fit_time 546如果return_train_score:

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _validation.py 在_score中(估算器,X_test,y_test,得分手) 589分=得分手(estimator,X_test) 590其他: -> 591分=得分手(estimator,X_test,y_test) 592 593 error_msg =(“得分必须返回数字,得到%s(%s)”

〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _scorer.py在 致电(自我,估算者,* args,** kwargs) 87 * args,** kwargs) 88其他 ---> 89分=得分手(估算器,* args,** kwargs) 90分[姓名] =分数 91次得分

〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _scorer.py在 _passthrough_scorer(估算器,* args,** kwargs) 第369章(二更) 370“”“包装estimator.score的函数”“” -> 371 return estimator.score(* args,** kwargs) 372 373

〜/ .local / lib / python3.8 / site-packages / sklearn / utils / metaestimators.py 在(* args,** kwargs)中 114 115#lambda(但不是部分)允许help()与update_wrapper一起使用 -> 116 out = lambda * args,** kwargs:self.fn(obj,* args,** kwargs) 117#更新返回函数的文档字符串 118 update_wrapper(out,self.fn)

〜/ .local / lib / python3.8 / site-packages / sklearn / pipeline.py在 得分(自我,X,Y,样本权重) 617如果sample_weight不为None: 618 score_params ['sample_weight'] = sample_weight -> 619 return self.steps [-1] [-1] .score(Xt,y,** score_params) 620 621 @property

〜/ .local / lib / python3.8 / site-packages / sklearn / base.py in score(self,X, y,sample_weight) 367“”“ 368从.metrics导入precision_score -> 369返回precision_score(y,self.predict(X),sample_weight = sample_weight) 370 371

〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _classification.py 以precision_score(y_true,y_pred,normalize,sample_weight) 183 184#计算每种可能表示的准确性 -> 185 y_type,y_true,y_pred = _check_targets(y_true,y_pred) 186 check_consistent_length(y_true,y_pred,sample_weight) 187 if y_type.startswith('multilabel'):

〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _classification.py 在_check_targets(y_true,y_pred)中 78 y_pred:数组或指标矩阵 79“”“ ---> 80 check_consistent_length(y_true,y_pred) 81 type_true = type_of_target(y_true) 82 type_pred = type_of_target(y_pred)

〜/ .local / lib / python3.8 / site-packages / sklearn / utils / validation.py在 check_consistent_length(*数组) 209个唯一身份= np.unique(长度) 如果len(uniques)> 1:则为210 -> 211引发ValueError(“找到数量不一致的输入变量” 212“样本:%r”%[长度为l的int(l)]) 213

ValueError:找到数量不一致的输入变量 样本:[251,250]

不同的尺寸/形状是:

kmeans_transformed.shape, kmeans_labels.shape, data.loc[:500, ['time', 'x', 'y']].shape                                       
# ((501, 5), (501,), (501, 3))

我不知道错误如何出现在“样本:[251,25]”上?

怎么了?

谢谢!

1 个答案:

答案 0 :(得分:1)

250和251分别是火车的形状和在GridSearchCV中的验证

看看您的自定义估算器...

def transform(self, X):

    return self.X_transformed

原始的变换方法不应用任何类型的操作,它仅返回火车数据。我们需要一个能够灵活地转换新数据的估算器(在酸性情况下,它可以在gridsearch中进行验证)。以此方式更改转换方法

def transform(self, X):

    return self.kmeans_clust_model.transform(X)