当我有一个数据帧时如何使用Scikit kmeans

时间:2016-01-23 02:35:43

标签: python scikit-learn k-means

我已将数据集转换为dataframe。我想知道如何在scikit kmeans或任何其他kmeans包中使用它。

import csv
import codecs
import pandas as pd
import sklearn
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
sample_df = pd.read_csv('sample.csv',sep='\t',keep_default_na=False, na_values=[""])
print sample_df['Polarity']
print sample_df['Gravity']
print sample_df['Sense']
print sample_df[['Polarity','Gravity']]
sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_    state=None, copy_x=True, n_jobs=1)

除了对火车/测试分裂的友好帮助之外。提前谢谢。

1 个答案:

答案 0 :(得分:7)

sklearnpandas DataFrame完全兼容。因此,它很简单:

sample_df_train, sample_df_test = sklearn.cross_validation.train_test_split(sample_df, train_size=0.6)

cluster = sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)
cluster.fit(sample_df_train)
result = cluster.predict(sample_df_test)

0.6表示您将60%的数据用于培训,40%用于测试。

更多信息:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html