如何在scikit-learn中获得有意义的k结果

时间:2015-03-13 15:48:17

标签: python machine-learning scikit-learn k-means

我有一个如下所示的数据集:

  

{' dns_query_count':' 11',' http_hostnames_count':' 7',' dest_port_count':& #39; 3',' ip_count':' 11',' signature_count':' 0',' src_ip&# 39;:' 10.0.64.42',' http_user_agent_count':' 2'}

这已经从csv转换为dict

然后我用DictVectorizer转换它

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
d = vec.fit_transform(data).toarray()

然后我尝试使用Kmeans

from sklearn.cluster import KMeans
k = KMeans(n_clusters=2).fit(d)

我的问题是如何获得有关哪一行数据属于哪个群集的信息?

我希望得到这样的东西:

  

{' dns_query_count':' 11',' http_hostnames_count':' 7',' dest_port_count':& #39; 3',' ip_count':' 11',' signature_count':' 0',' src_ip&# 39;:' 10.0.64.42',' http_user_agent_count':' 2',群集:' 1'}

有人可以给我一步一步示例如何从我展示的原始数据到相同的数据,以及他们所属的集群的信息吗?

例如,我在这个数据集中使用了Weka,它向我展示了我想要的东西 - 我可以点击图表上的数据点并准确读取哪些数据点属于哪个集群。如何用sklearn获得类似的结果?

1 个答案:

答案 0 :(得分:2)

这将显示如何检索每行和群集中心的群集ID。我还测量了从每一行到每个质心的距离,以便您可以看到行已正确分配给聚类。

In [1]:

import pandas as pd
from sklearn.cluster import KMeans
from numpy.random import random
from scipy.spatial.distance import euclidean

# I'm going to generate some random data so you can just copy this and see it work

random_data = []

for i in range(0,10):
    random_data.append({'dns_query_count': random(),
 'http_hostnames_count': random(),
 'dest_port_count': random(),
 'ip_count': random(),
 'signature_count': random(),
 'src_ip': random(),
 'http_user_agent_count': random()}
)

df = pd.DataFrame(random_data)

km = KMeans(n_clusters=2).fit(df)

df['cluster_id'] = km.labels_

# get the cluster centers and compute the distance from each point to the center
# this will show that all points are assigned to the correct cluster

def distance_to_centroid(row, centroid):
    row = row[['dns_query_count',
                'http_hostnames_count',
                'dest_port_count',
                'ip_count',
                'signature_count',
                'src_ip',
                'http_user_agent_count']]
    return euclidean(row, centroid)

# to get the cluster centers use km.cluster_centers_

df['distance_to_center0'] = df.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[0]),1)

df['distance_to_center1'] = df.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[1]),1)

df.head()

Out [1]:
   dest_port_count  dns_query_count  http_hostnames_count  \
0         0.516920         0.135925              0.090209   
1         0.528907         0.898578              0.752862   
2         0.426108         0.604251              0.524905   
3         0.373985         0.606492              0.503487   
4         0.319943         0.970707              0.707207   

   http_user_agent_count  ip_count  signature_count    src_ip  cluster_id  \
0               0.987878  0.808556         0.860859  0.642014           0   
1               0.417033  0.130365         0.067021  0.322509           1   
2               0.528679  0.216118         0.041491  0.522445           1   
3               0.780292  0.130404         0.048353  0.911599           1   
4               0.156117  0.719902         0.484865  0.752840           1   

   distance_to_center0  distance_to_center1  
0             0.846099             1.124509  
1             1.175765             0.760310  
2             0.970046             0.615725  
3             1.054555             0.946233  
4             0.640906             1.020849  

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict