Question

我有以下数据：

import pandas as pd
import random
import matplotlib.pyplot as plt

df = pd.DataFrame()
df['x'] = [3, 2, 4, 3, 4, 6, 8, 7, 8, 9]
df['y'] = [3, 2, 3, 4, 5, 6, 5, 4, 4, 3]
df['val'] = [1, 10, 1, 1, 1, 8, 1, 1, 1, 1]

k = 2
centroids = {i + 1: [np.random.randint(0, 10), np.random.randint(0, 10)] for i in range(k)}

plt.scatter(df['x'], df['y'], color='blue')
for i in centroids.keys():
    plt.scatter(*centroids[i], color='red', marker='^')
plt.show()

我想将初始质心放在具有最高值的数据点上。然后，在这种情况下，质心应位于坐标为（2,2）和（6,6）的数据点上。

   x  y  val
0  3  3    1
1  2  2   10
2  4  3    1
3  3  4    1
4  4  5    1
5  6  6    8
6  8  5    1
7  7  4    1
8  8  4    1
9  9  3    1

Answer 1

您可以按val列对数据框进行排序，以获取最高k值的索引，然后使用df.iloc对数据框进行切片。

按降序排序：

df = df.sort_values('val', ascending=False)
print(df)

   x  y  val
1  2  2   10
5  6  6    8
0  3  3    1
2  4  3    1
3  3  4    1
4  4  5    1
6  8  5    1
7  7  4    1
8  8  4    1
9  9  3    1

切片数据框：

k=2 # Number of centroids
highest_points_as_centroids = df.iloc[0:k,[0,1]]

print(highest_points_as_centroids )

   x  y
1  2  2
5  6  6

您可以通过highest_points_as_centroids.values

获取x，y值作为numpy数组

array([[2, 2],
       [6, 6]], dtype=int64)

EDIT1：

或者，更简洁（正如@sharatpc所建议的）

df.nlargest(2, 'val')[['x','y']].values
array([[2, 2],
   [6, 6]], dtype=int64)

EDIT2：

OP评论说他们希望质心在字典中：

centroids = highest_points_as_centroids.reset_index(drop=True).T.to_dict('list')
print(centroids)
{0: [2L, 2L], 1: [6L, 6L]}

如果从1开始严格需要字典键：

highest_points_as_centroids.reset_index(drop=True, inplace=True)
highest_points_as_centroids.index +=1
centroids = highest_points_as_centroids.T.to_dict('list')
print(centroids)
{1: [2L, 2L], 2: [6L, 6L]}

Answer 2

在一个地方回答@ arzamoona的其他问题：

import pandas as pd
import random
import matplotlib.pyplot as plt

df = pd.DataFrame()
df['x'] = [3, 2, 4, 3, 4, 6, 8, 7, 8, 9]
df['y'] = [3, 2, 3, 4, 5, 6, 5, 4, 4, 3]
df['val'] = [1, 10, 1, 1, 1, 8, 1, 1, 1, 1]

k = 2
centroids=df.nlargest(k, 'val')[['x','y']]

plt.scatter(df['x'], df['y'], color='blue')
plt.scatter(centroids.x, centroids.y, color='red', marker='^')
plt.show()

然后将质心值添加到字典中：

{i:v for i,v in enumerate(centroids.values.tolist())}
{0: [2, 2], 1: [6, 6]}

Python：如何将初始质心放在k-means的特定数据点上？

2 个答案: