Question

我聚集了大约100条记录（未标记）的样本，并尝试使用grid_search来评估具有各种超参数的聚类算法。我使用silhouette_score得分很好。

我的问题是，我不需要使用GridSearchCV / RandomizedSearchCV的交叉验证方面，但我找不到简单的{{1} } / GridSearch。我可以自己写，但RandomizedSearch和ParameterSampler对象非常有用。

我的下一步将是子类ParameterGrid并实现我自己的BaseSearchCV方法，但认为值得一提的是有更简单的方法来做到这一点，例如通过传递一些内容到{{ 1}}参数？

_fit()

Answer 1

clusteval库将帮助您评估数据并找到最佳群集数。该库包含五种可用于评估聚类的方法：剪影， dbindex ，衍生， dbscan 和< em> hdbscan 。

pip install clusteval

根据您的数据，可以选择评估方法。

# Import library
from clusteval import clusteval

# Set parameters, as an example dbscan
ce = clusteval(method='dbscan')

# Fit to find optimal number of clusters using dbscan
results= ce.fit(X)

# Make plot of the cluster evaluation
ce.plot()

# Make scatter plot. Note that the first two coordinates are used for plotting.
ce.scatter(X)

# results is a dict with various output statistics. One of them are the labels.
cluster_labels = results['labx']

Answer 2

最近我遇到了类似的问题。我定义了自定义可迭代cv_custom，它定义了拆分策略，并且是交叉验证参数cv的输入。对于每个折叠，该可迭代应该包含一对夫妇，其样本由其索引标识，例如， ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ...在我们的案例中，我们只需要一对夫妇一次折叠，列车中所有示例的索引以及测试部分([train_ids], [test_ids])

N = len(distance_matrix)
cv_custom = [(range(0,N), range(0,N))]
scores = cross_val_score(clf, X, y, cv=cv_custom)

Answer 3

好吧，这可能是一个老问题，但是我使用这种代码：

首先，我们要生成所有可能的参数组合：

def make_generator(parameters):
    if not parameters:
        yield dict()
    else:
        key_to_iterate = list(parameters.keys())[0]
        next_round_parameters = {p : parameters[p]
                    for p in parameters if p != key_to_iterate}
        for val in parameters[key_to_iterate]:
            for pars in make_generator(next_round_parameters):
                temp_res = pars
                temp_res[key_to_iterate] = val
                yield temp_res

然后创建一个循环：

# add fix parameters - here - it's just a random one
fixed_params = {"max_iter":300 } 

param_grid = {"n_clusters": range(2, 11)}

for params in make_generator(param_grid):
    params.update(fixed_params)
    ca = KMeans( **params )
    ca.fit(_data)
    labels = ca.labels_
    # Estimate your clustering labels and 
    # make decision to save or discard it!

当然，它可以组合成漂亮的功能。因此，该解决方案主要是一个示例。

希望它对某人有帮助！

网格搜索scikit-learn中聚类的超参数评估

3 个答案: