Question

我已经在一个约200个Python样本的小型监督数据集中训练并测试了KNN模型。我想将这些结果应用于更大的数千个样本的无监督数据集。

我的问题是：有没有办法使用小型监督数据集拟合KNN模型，然后更改大型非监督数据集的K值？我不想通过使用较小数据集中的低K值来过度拟合模型，但是不确定如何拟合模型，然后在Python中更改K值。

使用KNN可以吗？还有其他方法可以将KNN应用于更大的无监督数据集吗？

Answer 1

简短的回答，如果您设置了具有给定k值的KNN分类器，那么以后进行预测时就不能要求其具有不同的k值。

也就是说，我认为这不是您需要在此处执行的操作。如果将监督数据集分成交叉验证折叠（see the scikit-learn docs），则可以尝试使用不同的k值，然后为最终分类器选择效果最好的值，并以此对较大的数据集进行预测。< / p>

Answer 2

在机器学习中，学习者有两种类型，即渴望学习者（决策树，神经网络，svms ...）和诸如KNN之类的懒惰学习者。实际上，KNN根本不做任何学习。它只是存储您拥有的“标签化”数据，然后使用它来进行推断，从而计算出新样本（未标记化）与其已存储数据（标签化数据）中的所有样本的相似程度。然后根据新样本的K最近实例（K最近邻居，因此得名）的多数表决，将推断出它的类/值。

现在开始回答您的问题，对KNN本身进行“培训”与K无关，因此在进行推理时，可以随意使用K来提供最佳结果为你。

Answer 3

我建议在不同的时间实际将KNN模型拟合到较大的数据集，每次使用k的不同值。然后，对于每个模型，您都可以计算Silhouette Score。

比较各种轮廓得分，并为最终得分k（集群数）选择用于最高得分模型的值。

例如，下面是我去年为自己执行的一些代码：

from sklearn import mixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt


## A list of the different numbers of clusters (the 'n_components' parameter) with 
## which we will run GMM.
number_of_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

## Graph plotting method
def makePlot(number_of_clusters, silhouette_scores):
    # Plot the each value of 'number of clusters' vs. the silhouette score at that value
    fig, ax = plt.subplots(figsize=(16, 6))
    ax.set_xlabel('GMM - number of clusters')
    ax.set_ylabel('Silhouette Score (higher is better)')
    ax.plot(number_of_clusters, silhouette_scores)

    # Ticks and grid
    xticks = np.arange(min(number_of_clusters), max(number_of_clusters)+1, 1.0)
    ax.set_xticks(xticks, minor=False)
    ax.set_xticks(xticks, minor=True)
    ax.xaxis.grid(True, which='both')
    yticks = np.arange(round(min(silhouette_scores), 2), max(silhouette_scores), .02)
    ax.set_yticks(yticks, minor=False)
    ax.set_yticks(yticks, minor=True)
    ax.yaxis.grid(True, which='both')

## Graph the mean silhouette score of each cluster amount.
## Print out the number of clusters that results in the highest
## silhouette score for GMM.
def findBestClusterer(number_of_clusters):
    silhouette_scores = []
    for i in number_of_clusters:
        clusterer = mixture.GMM(n_components=i) # Use the model of your choice here
        clusterer.fit(<your data set>) # enter your data set's variable name here
        preds = clusterer.predict(<your data set>)
        score = silhouette_score(<your data set>, preds)
        silhouette_scores.append(score)

    ## Print a table of all the silhouette scores
    print("")
    print("| Number of clusters | Silhouette score |")
    print("| ------------------ | ---------------- |")
    for i in range(len(number_of_clusters)):
        ## Ensure printed table is properly formatted, taking into account
        ## amount of digits (either one or two) in the value for number of clusters.
        if number_of_clusters[i] <= 9:
            print("| {number}                  | {score:.4f}           |".format(number=number_of_clusters[i], 
                                                                        score=round(silhouette_scores[i], 4)))
        else:
            print("| {number}                 | {score:.4f}           |".format(number=number_of_clusters[i], 
                                                                        score=round(silhouette_scores[i], 4)))


    ## Graph the plot of silhoutte scores for each amount of clusters
    makePlot(number_of_clusters, silhouette_scores)

    ## Find and print out the cluster amount that gives the highest 
    ## silhouette score.
    best_silhouette_score = max(silhouette_scores)
    index_of_best_score = silhouette_scores.index(best_silhouette_score)
    ideal_number_of_clusters = number_of_clusters[index_of_best_score]
    print("")
    print("Having {} clusters gives the highest silhouette score of {}.".format(ideal_number_of_clusters,
                                                                                round(best_silhouette_score, 4)))

findBestClusterer(number_of_clusters)

请注意，在我的示例中，我使用的是GMM模型而不是KNN，但是您应该能够稍微修改findBestClusterer()方法以使用所需的任何聚类算法。在这种方法中，您还将指定数据集。

在Python中将小型监督数据集的KNN应用于大型非监督数据集

3 个答案: