documentation对此含糊不清,而我认为实施起来会很简单。
应用于MNIST数字数据集的k_mean算法会输出10个具有一定编号的区域,尽管它不是该区域中包含的大多数数字所代表的数字。
我确实有我的ground_truth标签表。
如何使k_mean算法生成的每个区域最终都被标记为最有可能被覆盖的数字?
昨天我花了几个小时来编写此代码,但是仍然不完整:
# TODO: for centroid-average method, see https://stackoverflow.com/a/25831425/9768291
def most_probable_digit(indices, data):
"""
Avec un tableau d'indices (d'un label spécifique assigné par scikit, obtenu avec "get_indices_of_label")
où se situent les vrais labels dans 'data', cette fonction calcule combien de fois chaque vrai label
apparaît et retourne celui qui est apparu le plus souvent (et donc qui a la plus grande probabilité
d'être le ground_truth_label désigné par la région délimitée par scikit).
:param indices: tableau des indices dans 'data' qui font parti d'une région du k_mean
:param data: toutes les données réparties dans les régions du k_mean
:return: la valeur (le digit) le plus probable associé à cette région
"""
actual_labels = []
for i in indices:
actual_labels.append(data[i])
if verbose: print("The actual labels for each of those digits are:", actual_labels)
counts = count_labels("actual labels", actual_labels)
probable = counts.index(max(counts))
if verbose: print("Most probable digit:", probable)
return probable
def get_list_of_indices(data, label):
"""
Retourne une liste d'indices correspondant à tous les endroits
où on peut trouver dans 'data' le 'label' spécifié
:param data:
:param label: le numéro associé à une région générée par k_mean
:return:
"""
return (np.where(data == label))[0].tolist()
# TODO: reassign in case of doubles
def obtain_corresponding_labels(data, real_labels):
"""
Assign the most probable label to each region.
:param data: list of regions associated with x_train or x_test (the order is preserved!)
:param real_labels: actual labels to assign to the region numbers
:return: the list of corresponding actual labels to region numbers
"""
switches_to_make = []
for i in range(10):
list_of_indices = get_list_of_indices(data, i) # indices in 'data' which are associated with region "i"
probable_label = most_probable_digit(list_of_indices, real_labels)
print("The assigned region", i, "should be considered as representing the digit ", probable_label)
switches_to_make.append(probable_label)
return switches_to_make
def rearrange_labels(switches_to_make, to_change):
"""
Takes region numbers and assigns the most probable digit (label) to it.
For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
should be considered as representing the digit "5".
:param switches_to_make: list of changes to make
:param to_change: this table will be changed according to 'switches_to_make'
:return: nothing, the change is made in-situ
"""
for region in range(len(to_change)):
for label in range(len(switches_to_make)):
if to_change[region] == label: # if it corresponds to the "wrong" label given by scikit
to_change[region] = switches_to_make[label] # assign the "most probable" label
break
def count_error_rate(found, truth):
wrong = 0
for i in range(len(found)):
if found[i] != truth[i]:
wrong += 1
print("Error rate = ", wrong / len(found) * 100, "%\n\n")
def treat_data(switches_to_make, predictions, truth):
rearrange_labels(switches_to_make, predictions) # Rearranging the training labels
count_error_rate(predictions, truth) # Counting error rate
目前,我的代码存在的问题是它可以生成重复项(如果两个区域具有相同的最高概率数字,则该数字与两个区域相关联)。
这是我使用代码的方式:
kmeans = KMeans(n_clusters=10) # TODO: eventually use "init=ndarray" to be able to use custom centroids for init ?
kmeans.fit(x_train)
training_labels = kmeans.labels_
print("Done with calculating the k-mean.\n")
switches_to_make = utils.obtain_corresponding_labels(training_labels, y_train) # Obtaining the most probable labels
utils.treat_data(switches_to_make, training_labels, y_train)
print("Assigned labels: ", training_labels)
print("Real labels: ", y_train)
print("\n####################################################\nMoving on to predictions")
predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)
我的代码获得大约50%的错误率。
答案 0 :(得分:0)
如果我对您的理解正确,那么您想将实际数字值分配为与该群集匹配的群集标签,对吗?如果是这样,我认为不可能。
K-Means是一种无监督的学习算法。它不了解它在看什么,并且它分配的标签是任意的。而不是0、1、2,...本来可以将它们称为'apple','orange','grape'...。 K-Means所能做的就是告诉您,基于某个度量标准,一堆数据点彼此相似,仅此而已。它非常适合数据探索或模式查找。但不是要告诉您它实际上是什么。
您进行什么后期处理都没有关系,因为计算机永远不会以编程方式知道真正的标签是什么,除非您是人类告诉您。在这种情况下,您最好使用监督学习算法。
如果您想训练一个模型,即当它看到一个数字时,可以为其分配正确的标签,则必须使用监督学习方法(其中的标签是东西)。例如,请查看Random Forest。 Here是类似的尝试。
答案 1 :(得分:0)
以下是使用我的解决方案的代码:
from sklearn.cluster import KMeans
import utils
# Extraction du dataset
x_train, y_train = utils.get_train_data()
x_test, y_test = utils.get_test_data()
kmeans = KMeans(n_clusters=10)
kmeans.fit(x_train)
training_labels = kmeans.labels_
switches_to_make = utils.find_closest_digit_to_centroids(kmeans, x_train, y_train) # Obtaining the most probable labels (digits) for each region
utils.treat_data(switches_to_make, training_labels, y_train)
predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)
还有utils.py
:
import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances_argmin_min
use_reduced = True # Flag variable to use the reduced datasets (generated by 'pre_process.py')
verbose = False # Should debugging prints be shown
def get_data(reduced_path, path):
"""
Pour obtenir le dataset désiré.
:param reduced_path: path vers la version réduite (générée par 'pre_process.py')
:param path: path vers la version complète
:return: numpy arrays (data, labels)
"""
if use_reduced:
data = open(reduced_path)
else:
data = open(path)
csv_file = csv.reader(data)
data_points = []
for row in csv_file:
data_points.append(row)
data_points.pop(0) # On enlève la première ligne, soit les "headers" de nos colonnes
data.close()
# Pour passer de String à int
for i in range(len(data_points)): # for each image
for j in range(len(data_points[0])): # for each pixel
data_points[i][j] = int(data_points[i][j])
# # Pour obtenir des valeurs en FLOAT normalisées entre 0 et 1:
# data_points[i][j] = np.divide(float(data_points[i][j]), 255)
# Pour séparer les labels du data
y_train = [] # labels
for row in data_points:
y_train.append(row[0]) # first column is the label
x_train = [] # data
for row in data_points:
x_train.append(row[1:785]) # other columns are the pixels
x_train = np.array(x_train)
y_train = np.array(y_train)
print("Done with loading the dataset.")
return x_train, y_train
def get_test_data():
"""
Retourne le dataset de test désiré.
:return: numpy arrays (data, labels)
"""
return get_data('../data/reduced_mnist_test.csv', '../data/mnist_test.csv')
def get_train_data():
"""
Retourne le dataset de training désiré.
:return: numpy arrays (data, labels)
"""
return get_data('../data/reduced_mnist_train.csv', '../data/mnist_train.csv')
def display_data(x_train, y_train):
"""
Affiche le digit voulu.
:param x_train: le data (784D)
:param y_train: le label associé
:return:
"""
# Exemple pour afficher: conversion de notre vecteur d'une dimension en 2 dimensions
matrix = np.reshape(x_train, (28, 28))
plt.imshow(matrix, cmap='gray')
plt.title("Voici un " + str(y_train))
plt.show()
def generate_mean_images(x_train, y_train):
"""
Retourne le tableau des images moyennes pour chaque classe
:param x_train:
:param y_train:
:return:
"""
counts = np.zeros(10).astype(int)
for label in y_train:
counts[label] += 1
sum_pixel_values = np.zeros((10, 784)).astype(int)
for img in range(len(y_train)):
for pixel in range(len(x_train[0])):
sum_pixel_values[y_train[img]][pixel] += x_train[img][pixel]
pixel_probability = np.zeros((len(counts), len(x_train[0]))) # (10, 784)
for classe in range(len(counts)):
for pixel in range(len(x_train[0])):
pixel_probability[classe][pixel] = np.divide(sum_pixel_values[classe][pixel] + 1, counts[classe] + 2)
mean_images = []
if verbose:
plt.figure(figsize=(20, 4)) # values of the size of the plot: (x,y) in INCHES
plt.suptitle("Such wow, much impress !")
for classe in range(len(counts)):
class_mean = np.reshape(pixel_probability[classe], (28, 28))
mean_images.append(class_mean)
# Aesthetics
plt.subplot(1, 10, classe + 1)
plt.title(str(classe))
plt.imshow(class_mean, cmap='gray')
plt.xticks([])
plt.yticks([])
plt.show()
return mean_images
#########
# used for "k_mean" (for now)
def count_labels(name, data):
"""
S'occupe de compter le nombre de data associé à chacun des labels.
:param name: nom de ce que l'on compte
:param data: doit être 1D
:return: counts = le nombre pour chaque label
"""
header = "-- " + str(name) + " -- " # making sure it's a String
counts = [0]*10 # initializing the counting array
for label in data:
counts[label] += 1
if verbose: print(header, "Amounts for each label:", counts)
return counts
def get_list_of_indices(data, label):
"""
Retourne une liste d'indices correspondant à tous les endroits
où on peut trouver dans 'data' le 'label' spécifié
:param data:
:param label: le numéro associé à une région générée par k_mean
:return:
"""
return (np.where(data == label))[0].tolist()
def rearrange_labels(switches_to_make, to_change):
"""
Takes region numbers and assigns the most probable digit (label) to it.
For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
should be considered as representing the digit "5".
:param switches_to_make: list of changes to make
:param to_change: this table will be changed according to 'switches_to_make'
:return: nothing, the change is made in-situ
"""
for region in range(len(to_change)):
for label in range(len(switches_to_make)):
if to_change[region] == label: # if it corresponds to the "wrong" label given by scikit
to_change[region] = switches_to_make[label] # assign the "most probable" label
break
def count_error_rate(found, truth):
wrong = 0
for i in range(len(found)):
if found[i] != truth[i]:
wrong += 1
percent = wrong / len(found) * 100
print("Error rate = ", percent, "%")
return percent
def treat_data(switches_to_make, predictions, truth):
rearrange_labels(switches_to_make, predictions) # Rearranging the training labels
count_error_rate(predictions, truth) # Counting error rate
# TODO: reassign in case of doubles
# adapted from https://stackoverflow.com/a/45275056/9768291
def find_closest_digit_to_centroids(kmean, data, labels):
"""
The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
:param kmean: the variable where the 'fit' data has been stored
:param data: the actual data that was used with 'fit' (x_train)
:param labels: the true labels associated with 'data' (y_train)
:return: list where each region is at its index and the value at that index is the digit it represents
"""
closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
data,
metric="euclidean")
switches_to_make = []
for region in range(len(closest)):
truth = labels[closest[region]]
print("The assigned region", region, "should be considered as representing the digit ", truth)
switches_to_make.append(truth)
print("Digits associated to each region (switches_to_make):", switches_to_make)
return switches_to_make
本质上,这是解决我的问题的功能:
# adapted from https://stackoverflow.com/a/45275056/9768291
def find_closest_digit_to_centroids(kmean, data, labels):
"""
The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
:param kmean: the variable where the 'fit' data has been stored
:param data: the actual data that was used with 'fit' (x_train)
:param labels: the true labels associated with 'data' (y_train)
:return: list where each region is at its index and the value at that index is the digit it represents
"""
closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
data,
metric="euclidean")
switches_to_make = []
for region in range(len(closest)):
truth = labels[closest[region]]
print("The assigned region", region, "should be considered as representing the digit ", truth)
switches_to_make.append(truth)
print("Digits associated to each region (switches_to_make):", switches_to_make)
return switches_to_make