如何在多变量数据集中实现k均值?

时间:2019-06-10 19:10:34

标签: python cluster-analysis data-analysis

https://archive.ics.uci.edu/ml/datasets/Auto+MPG 我有这个数据集,我已经修复了缺失的值,并对数据进行了归一化。如何使用k-means?到目前为止,我发现的所有内容都是针对两个变量的。

1 个答案:

答案 0 :(得分:0)

您可以使用scikit-learn进行k均值聚类。请参阅以下代码以了解如何实现。

from sklearn.cluster import KMeans

# ---------- DATA ----------------
import numpy as np
np.random.seed(0)

# generated training data 
data = np.random.randint(1, 1000, size=(500, 25)) # data has 500 samples with 25 dim each

# testing data
test_data = np.random.randint(1, 1000, size=(10, 25)) # test_data has 10 samples with 25 dim each
# --------------------------------

# using KMean clustering from scikit-learn for training
kmeans = KMeans(n_clusters=16, random_state=0).fit(data)  # creating 16 clusters with the data

# labels for your clusters
kmean_labels = kmeans.labels_

# Predict the closest cluster for each sample
predicted_labels = kmeans.predict(test_data)

有关更多详细信息,请参阅this link