Question

documentation对此含糊不清，而我认为实施起来会很简单。

应用于MNIST数字数据集的k_mean算法会输出10个具有一定编号的区域，尽管它不是该区域中包含的大多数数字所代表的数字。

我确实有我的ground_truth标签表。

如何使k_mean算法生成的每个区域最终都被标记为最有可能被覆盖的数字？

昨天我花了几个小时来编写此代码，但是仍然不完整：

# TODO: for centroid-average method, see   https://stackoverflow.com/a/25831425/9768291
def most_probable_digit(indices, data):
    """
    Avec un tableau d'indices (d'un label spécifique assigné par scikit, obtenu avec "get_indices_of_label")
    où se situent les vrais labels dans 'data', cette fonction calcule combien de fois chaque vrai label
    apparaît et retourne celui qui est apparu le plus souvent (et donc qui a la plus grande probabilité
    d'être le ground_truth_label désigné par la région délimitée par scikit).
    :param indices: tableau des indices dans 'data' qui font parti d'une région du k_mean
    :param data: toutes les données réparties dans les régions du k_mean
    :return: la valeur (le digit) le plus probable associé à cette région
    """
    actual_labels = []
    for i in indices:
        actual_labels.append(data[i])
    if verbose: print("The actual labels for each of those digits are:", actual_labels)
    counts = count_labels("actual labels", actual_labels)
    probable = counts.index(max(counts))
    if verbose: print("Most probable digit:", probable)
    return probable


def get_list_of_indices(data, label):
    """
    Retourne une liste d'indices correspondant à tous les endroits
    où on peut trouver dans 'data' le 'label' spécifié
    :param data:
    :param label: le numéro associé à une région générée par k_mean
    :return:
    """
    return (np.where(data == label))[0].tolist()


# TODO: reassign in case of doubles
def obtain_corresponding_labels(data, real_labels):
    """
    Assign the most probable label to each region.
    :param data: list of regions associated with x_train or x_test (the order is preserved!)
    :param real_labels: actual labels to assign to the region numbers
    :return: the list of corresponding actual labels to region numbers
    """
    switches_to_make = []

    for i in range(10):
        list_of_indices = get_list_of_indices(data, i)  # indices in 'data' which are associated with region "i"
        probable_label = most_probable_digit(list_of_indices, real_labels)
        print("The assigned region", i, "should be considered as representing the digit ", probable_label)
        switches_to_make.append(probable_label)

    return switches_to_make


def rearrange_labels(switches_to_make, to_change):
    """
    Takes region numbers and assigns the most probable digit (label) to it.
    For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
    should be considered as representing the digit "5".
    :param switches_to_make: list of changes to make
    :param to_change: this table will be changed according to 'switches_to_make'
    :return: nothing, the change is made in-situ
    """
    for region in range(len(to_change)):
        for label in range(len(switches_to_make)):
            if to_change[region] == label:                    # if it corresponds to the "wrong" label given by scikit
                to_change[region] = switches_to_make[label]   # assign the "most probable" label
                break


def count_error_rate(found, truth):
    wrong = 0
    for i in range(len(found)):
        if found[i] != truth[i]:
            wrong += 1
    print("Error rate =     ", wrong / len(found) * 100, "%\n\n")


def treat_data(switches_to_make, predictions, truth):
    rearrange_labels(switches_to_make, predictions)    # Rearranging the training labels
    count_error_rate(predictions, truth)               # Counting error rate

目前，我的代码存在的问题是它可以生成重复项（如果两个区域具有相同的最高概率数字，则该数字与两个区域相关联）。

这是我使用代码的方式：

kmeans = KMeans(n_clusters=10)  # TODO: eventually use "init=ndarray" to be able to use custom centroids for init ?
kmeans.fit(x_train)
training_labels = kmeans.labels_
print("Done with calculating the k-mean.\n")

switches_to_make = utils.obtain_corresponding_labels(training_labels, y_train)  # Obtaining the most probable labels

utils.treat_data(switches_to_make, training_labels, y_train)
print("Assigned labels:   ", training_labels)
print("Real labels:       ", y_train)


print("\n####################################################\nMoving on to predictions")
predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)

我的代码获得大约50％的错误率。

Answer 1

如果我对您的理解正确，那么您想将实际数字值分配为与该群集匹配的群集标签，对吗？如果是这样，我认为不可能。

K-Means是一种无监督的学习算法。它不了解它在看什么，并且它分配的标签是任意的。而不是0、1、2，...本来可以将它们称为'apple'，'orange'，'grape'...。 K-Means所能做的就是告诉您，基于某个度量标准，一堆数据点彼此相似，仅此而已。它非常适合数据探索或模式查找。但不是要告诉您它实际上是什么。

您进行什么后期处理都没有关系，因为计算机永远不会以编程方式知道真正的标签是什么，除非您是人类告诉您。在这种情况下，您最好使用监督学习算法。

如果您想训练一个模型，即当它看到一个数字时，可以为其分配正确的标签，则必须使用监督学习方法（其中的标签是东西）。例如，请查看Random Forest。 Here是类似的尝试。

Answer 2

以下是使用我的解决方案的代码：

from sklearn.cluster import KMeans

import utils

# Extraction du dataset
x_train, y_train = utils.get_train_data()
x_test,  y_test  = utils.get_test_data()

kmeans = KMeans(n_clusters=10)
kmeans.fit(x_train)
training_labels = kmeans.labels_

switches_to_make = utils.find_closest_digit_to_centroids(kmeans, x_train, y_train)  # Obtaining the most probable labels (digits) for each region

utils.treat_data(switches_to_make, training_labels, y_train)

predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)

还有utils.py：

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances_argmin_min


use_reduced = True  # Flag variable to use the reduced datasets (generated by 'pre_process.py')
verbose = False  # Should debugging prints be shown


def get_data(reduced_path, path):
    """
    Pour obtenir le dataset désiré.
    :param reduced_path: path vers la version réduite (générée par 'pre_process.py')
    :param path: path vers la version complète
    :return: numpy arrays (data, labels)
    """
    if use_reduced:
        data = open(reduced_path)
    else:
        data = open(path)
    csv_file = csv.reader(data)
    data_points = []
    for row in csv_file:
        data_points.append(row)
    data_points.pop(0)  # On enlève la première ligne, soit les "headers" de nos colonnes
    data.close()

    # Pour passer de String à int
    for i in range(len(data_points)):  # for each image
        for j in range(len(data_points[0])):  # for each pixel
            data_points[i][j] = int(data_points[i][j])
            # # Pour obtenir des valeurs en FLOAT normalisées entre 0 et 1:
            # data_points[i][j] =  np.divide(float(data_points[i][j]), 255)

    # Pour séparer les labels du data
    y_train = []  # labels
    for row in data_points:
        y_train.append(row[0])  # first column is the label
    x_train = []  # data
    for row in data_points:
        x_train.append(row[1:785])  # other columns are the pixels

    x_train = np.array(x_train)
    y_train = np.array(y_train)
    print("Done with loading the dataset.")

    return x_train, y_train


def get_test_data():
    """
    Retourne le dataset de test désiré.
    :return: numpy arrays (data, labels)
    """
    return get_data('../data/reduced_mnist_test.csv', '../data/mnist_test.csv')


def get_train_data():
    """
    Retourne le dataset de training désiré.
    :return: numpy arrays (data, labels)
    """
    return get_data('../data/reduced_mnist_train.csv', '../data/mnist_train.csv')


def display_data(x_train, y_train):
    """
    Affiche le digit voulu.
    :param x_train: le data (784D)
    :param y_train: le label associé
    :return:
    """
    # Exemple pour afficher: conversion de notre vecteur d'une dimension en 2 dimensions
    matrix = np.reshape(x_train, (28, 28))
    plt.imshow(matrix, cmap='gray')
    plt.title("Voici un " + str(y_train))
    plt.show()


def generate_mean_images(x_train, y_train):
    """
    Retourne le tableau des images moyennes pour chaque classe
    :param x_train:
    :param y_train:
    :return:
    """
    counts = np.zeros(10).astype(int)

    for label in y_train:
        counts[label] += 1

    sum_pixel_values = np.zeros((10, 784)).astype(int)

    for img in range(len(y_train)):
        for pixel in range(len(x_train[0])):
            sum_pixel_values[y_train[img]][pixel] += x_train[img][pixel]

    pixel_probability = np.zeros((len(counts), len(x_train[0])))  # (10, 784)

    for classe in range(len(counts)):
        for pixel in range(len(x_train[0])):
            pixel_probability[classe][pixel] = np.divide(sum_pixel_values[classe][pixel] + 1, counts[classe] + 2)

    mean_images = []

    if verbose:
        plt.figure(figsize=(20, 4))  # values of the size of the plot: (x,y) in INCHES
        plt.suptitle("Such wow, much impress !")

        for classe in range(len(counts)):
            class_mean = np.reshape(pixel_probability[classe], (28, 28))
            mean_images.append(class_mean)

            # Aesthetics
            plt.subplot(1, 10, classe + 1)
            plt.title(str(classe))
            plt.imshow(class_mean, cmap='gray')
            plt.xticks([])
            plt.yticks([])

        plt.show()

    return mean_images


#########
# used for "k_mean" (for now)


def count_labels(name, data):
    """
    S'occupe de compter le nombre de data associé à chacun des labels.
    :param name: nom de ce que l'on compte
    :param data: doit être 1D
    :return: counts = le nombre pour chaque label
    """
    header = "-- " + str(name) + " -- "  # making sure it's a String
    counts = [0]*10  # initializing the counting array

    for label in data:
        counts[label] += 1
    if verbose: print(header, "Amounts for each label:", counts)

    return counts


def get_list_of_indices(data, label):
    """
    Retourne une liste d'indices correspondant à tous les endroits
    où on peut trouver dans 'data' le 'label' spécifié
    :param data:
    :param label: le numéro associé à une région générée par k_mean
    :return:
    """
    return (np.where(data == label))[0].tolist()


def rearrange_labels(switches_to_make, to_change):
    """
    Takes region numbers and assigns the most probable digit (label) to it.
    For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
    should be considered as representing the digit "5".
    :param switches_to_make: list of changes to make
    :param to_change: this table will be changed according to 'switches_to_make'
    :return: nothing, the change is made in-situ
    """
    for region in range(len(to_change)):
        for label in range(len(switches_to_make)):
            if to_change[region] == label:                    # if it corresponds to the "wrong" label given by scikit
                to_change[region] = switches_to_make[label]   # assign the "most probable" label
                break


def count_error_rate(found, truth):
    wrong = 0
    for i in range(len(found)):
        if found[i] != truth[i]:
            wrong += 1
    percent = wrong / len(found) * 100

    print("Error rate =     ", percent, "%")
    return percent


def treat_data(switches_to_make, predictions, truth):
    rearrange_labels(switches_to_make, predictions)    # Rearranging the training labels
    count_error_rate(predictions, truth)               # Counting error rate


# TODO: reassign in case of doubles
# adapted from  https://stackoverflow.com/a/45275056/9768291
def find_closest_digit_to_centroids(kmean, data, labels):
    """
    The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
    Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
    closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
    If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
    :param kmean: the variable where the 'fit' data has been stored
    :param data: the actual data that was used with 'fit' (x_train)
    :param labels: the true labels associated with 'data' (y_train)
    :return: list where each region is at its index and the value at that index is the digit it represents
    """
    closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
                                               data,
                                               metric="euclidean")

    switches_to_make = []
    for region in range(len(closest)):
        truth = labels[closest[region]]
        print("The assigned region", region, "should be considered as representing the digit ", truth)
        switches_to_make.append(truth)

    print("Digits associated to each region (switches_to_make):", switches_to_make)
    return switches_to_make

本质上，这是解决我的问题的功能：

# adapted from  https://stackoverflow.com/a/45275056/9768291
def find_closest_digit_to_centroids(kmean, data, labels):
    """
    The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
    Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
    closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
    If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
    :param kmean: the variable where the 'fit' data has been stored
    :param data: the actual data that was used with 'fit' (x_train)
    :param labels: the true labels associated with 'data' (y_train)
    :return: list where each region is at its index and the value at that index is the digit it represents
    """
    closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
                                               data,
                                               metric="euclidean")

    switches_to_make = []
    for region in range(len(closest)):
        truth = labels[closest[region]]
        print("The assigned region", region, "should be considered as representing the digit ", truth)
        switches_to_make.append(truth)

    print("Digits associated to each region (switches_to_make):", switches_to_make)
    return switches_to_make

将区域索引与真实标签相关联

2 个答案: