我有一个df:
id Type1 Type2 Type3
0 10000 0.0 0.00 0.00
1 10001 0.0 63.72 0.00
2 10002 473.6 174.00 31.60
3 10003 0.0 996.00 160.92
4 10004 0.0 524.91 0.00
我将k-means应用于此df,并将生成的簇添加到df:
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(df.drop('id', axis=1))
df['cluster'] = kmeans.labels_
现在我正尝试在df中添加列,以获取每个点(即df中的行)与每个质心之间的欧几里得距离:
def distance_to_centroid(row, centroid):
row = row[['Type1',
'Type2',
'Type3']]
return euclidean(row, centroid)
df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
这将导致此错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-56fa3ae3df54> in <module>()
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
~\_installed\anaconda\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6002 args=args,
6003 kwds=kwds)
-> 6004 return op.get_result()
6005
6006 def applymap(self, func):
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
<ipython-input-34-56fa3ae3df54> in <lambda>(r)
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
<ipython-input-33-7b988ca2ad8c> in distance_to_centroid(row, centroid)
7 'atype',
8 'anothertype']]
----> 9 return euclidean(row, centroid)
~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in euclidean(u, v, w)
596
597 """
--> 598 return minkowski(u, v, p=2, w=w)
599
600
~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in minkowski(u, v, p, w)
488 if p < 1:
489 raise ValueError("p must be at least 1")
--> 490 u_v = u - v
491 if w is not None:
492 w = _validate_weights(w)
ValueError: ('operands could not be broadcast together with shapes (7,) (8,) ', 'occurred at index 0')
该错误似乎正在发生,因为函数id
的{{1}}变量中没有包含row
。为了解决这个问题,我可以将df分为两部分(df1中的{distance_to_centroid
和df2中的其余列)。但是,这是非常手动的操作,因此无法轻松更改列。有没有办法在不分割原始df的情况下将到每个质心的距离变成原始df?同样,是否有更好的方法来找到欧氏距离,而无需手动将列输入id
变量,以及手动创建多少列作为簇?
预期结果:
row
答案 0 :(得分:4)
我们需要将df
的坐标部分传递到KMeans
,并且我们要仅使用df
的坐标部分来计算到形心的距离。因此,我们最好为此数量定义一个变量:
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
然后我们可以使用以下公式计算从每行的坐标部分到其相应质心的距离:
import scipy.spatial.distance as sdist
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
请注意,centroids[df['cluster']]
返回的NumPy数组的形状与points
相同。通过df['cluster']
编制索引“扩展” centroids
数组。
然后我们可以使用{p>将这些dist
值分配给DataFrame列
df['dist'] = dist
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
df['dist'] = dist
print(df)
收益
Type1 Type2 Type3 id cluster dist
0 0.0 0.00 0.00 1000 4 2.842171e-14
1 0.0 63.72 0.00 10001 2 2.842171e-14
2 473.6 174.00 31.60 10002 1 2.842171e-14
3 0.0 996.00 160.92 10003 3 2.842171e-14
4 0.0 524.91 0.00 10004 0 2.842171e-14
如果您希望每个点到每个聚类质心的距离,可以使用sdist.cdist
:
import scipy.spatial.distance as sdist
sdist.cdist(points, centroids)
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dists = pd.DataFrame(
sdist.cdist(points, centroids),
columns=['dist_{}'.format(i) for i in range(len(centroids))],
index=df.index)
df = pd.concat([df, dists], axis=1)
print(df)
收益
Type1 Type2 Type3 id cluster dist_0 dist_1 dist_2 dist_3 dist_4
0 0.0 0.00 0.00 1000 4 524.910000 505.540819 6.372000e+01 1008.915877 0.000000
1 0.0 63.72 0.00 10001 2 461.190000 487.295802 2.842171e-14 946.066195 63.720000
2 473.6 174.00 31.60 10002 1 590.282431 0.000000 4.872958e+02 957.446929 505.540819
3 0.0 996.00 160.92 10003 3 497.816266 957.446929 9.460662e+02 0.000000 1008.915877
4 0.0 524.91 0.00 10004 0 0.000000 590.282431 4.611900e+02 497.816266 524.910000