我有以下sioma_df
数据框:
这些是sioma_df
形状和列索引。它有13807行和37列:
sioma_df.columns
(13807, 37)
Index(['Luz (lux)', 'Precipitación (ml)', 'Temperatura (°C)',
'Velocidad del Viento (km/h)', 'E', 'N', 'NE', 'NO', 'O', 'S', 'SE',
'SO', 'PORVL2N1', 'PORVL2N2', 'PORVL4N1', 'PORVL5N1', 'PORVL6N1',
'PORVL7N1', 'PORVL8N1', 'PORVL9N1', 'PORVL10N1', 'PORVL13N1',
'PORVL14N1', 'PORVL15N1', 'PORVL16N1', 'PORVL16N2', 'PORVL18N1',
'PORVL18N2', 'PORVL18N3', 'PORVL18N4', 'PORVL21N1', 'PORVL21N2',
'PORVL21N3', 'PORVL21N4', 'PORVL21N5', 'PORVL24N1', 'PORVL24N2'],
dtype='object')
我想应用 k -means 算法,我决定在随机初始化阶段我会有 k=9
质心
# Turn the dataframe to numpy array
sioma_numpy = sioma_df.get_values()
k=9
# Create a dictionary with the centroids coordinates
centroids = {
i + 1: [np.random.randint(0, np.max(sioma_numpy)), np.random.randint(0, np.max(sioma_numpy))]
for i in range(k)
}
我在应用群集之前绘制数据
# I get each column individually into an array
c1 = sioma_df['Luz (lux)'].values
c2 = sioma_df['Precipitación (ml)'].values
c3 = sioma_df['Temperatura (°C)'].values
c4 = sioma_df['Velocidad del Viento (km/h)'].values
c5 = sioma_df['PORVL2N1'].values
c6 = sioma_df['PORVL2N2'].values
c7 = sioma_df['PORVL4N1'].values
c8 = sioma_df['PORVL5N1'].values
c9 = sioma_df['PORVL6N1'].values
c10 = sioma_df['PORVL7N1'].values
c11 = sioma_df['PORVL8N1'].values
c12 = sioma_df['PORVL9N1'].values
c13 = sioma_df['PORVL10N1'].values
c14 = sioma_df['PORVL13N1'].values
c15 = sioma_df['PORVL14N1'].values
c16 = sioma_df['PORVL15N1'].values
c17 = sioma_df['PORVL16N1'].values
c18 = sioma_df['PORVL16N2'].values
c19 = sioma_df['PORVL18N1'].values
c20 = sioma_df['PORVL18N2'].values
c21 = sioma_df['PORVL18N3'].values
c22 = sioma_df['PORVL18N4'].values
c23 = sioma_df['PORVL18N4'].values
c24 = sioma_df['PORVL21N1'].values
c25 = sioma_df['PORVL21N2'].values
c26 = sioma_df['PORVL21N3'].values
c27 = sioma_df['PORVL21N4'].values
c28 = sioma_df['PORVL21N5'].values
c29 = sioma_df['PORVL24N1'].values
c30 = sioma_df['E'].values
c31 = sioma_df['N'].values
c32 = sioma_df['NE'].values
c33 = sioma_df['NO'].values
c34 = sioma_df['O'].values
c35 = sioma_df['S'].values
c36 = sioma_df['SE'].values
c37 = sioma_df['S'].values
""" I generate the X and Y coordinates points of previous c1 to c36
variables above. With zip I've associate between each Ci and store in
a list to will represent array X and array Y
"""
X = np.array(list(zip(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18)))
print( " ARRAY X" +'\n', X, '\n' )
Y = np.array(list(zip(c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c33,c34,c35,c36,)))
print( " ARRAY Y" +'\n', Y, '\n' )
然后,我生成了对x,y质心坐标。
我想从分配阶段开始,我将数据点分配给最近的质心。我有以下内容:
def assignment(df, centroids):
# We take the k=9 centroids keys to iterations based
for i in centroids.keys():
# sqrt((x1 - x2)^2 - (y1 - y2)^2)
# I want create a new column in a sioma_df dataframe named
#distance_from_i
sioma_df['distance_from_{}'.format(i)] = (
# We calculate the distances between each data point and
# each one of the 9 centroids
# The distance_from_i column will have the distance value
# of each data point with reference to each centroid (Are 9 in total)
np.sqrt(
(X - centroids[i][0]) ** 2
+ (Y - centroids[i][1]) ** 2
)
)
# We iterate by each distance value of each data point i with
# reference to each centroid j to compare and meet to what
# distance is more closest
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
# We create the closest column in the sioma_df dataframe,
# selecting the more minimum values in the column axis=1:
sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
return df
# We wxecute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)
但是当我执行我的代码时,我收到以下错误:
KeyError: 'distance_from_1'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-160-b96e0351c13d> in <module>()
24
25 #
---> 26 df = assignment(sioma_df, centroids)
27 print(df.head)
<ipython-input-160-b96e0351c13d> in assignment(df, centroids)
11 np.sqrt(
12 (X - centroids[i][0]) ** 2
---> 13 + (Y - centroids[i][1]) ** 2
14 )
15 )
ValueError: Wrong number of items passed 18, placement implies 1
这表明您试图将太多数据放在太少的内存位置。在这种情况下,等式右边的值在这种情况下
sioma_df['distance_from'] = np.sqrt((X - centroids[i][0]) ** 2 + (Y - centroids[i][1]) ** 2)
我真的不明白如何在正确分配的意义上解决这个不便之处;这让我很难排除故障。
任何支持我指向正确方向的人都将受到高度赞赏
答案 0 :(得分:0)
我的问题是np.sqrt(…)
语句没有返回一维数组。
由于X
和Y
numpy数组的长度,每一行col位置都需要1个值,但它接收的数组长度为18个元素。
numpy数组的操作是元素方面的,因此可能不会改变正在操作的数组的形状。
然后,当我想要创建新的distance_from_i
列并进行此操作时:
sioma_df['distance_from_{}'.format(i)] = (
np.sqrt(
(X - centroids[i][0]) ** 2
+ (Y - centroids[i][1]) ** 2
)
)
我分配给这个distance_from_i
列,而不是一维数组,这是一个必须接收或接受的能力,否则,我的distance_from_i
列(每行,col)收到一个长度为18个元素的数组,这就是错误的原因
ValueError: Wrong number of items passed 18, placement implies 1
然后,我已将新的distance_from_i
列初始化为NaN
值,然后为其分配np.sqrt(…)
语句的结果值,并且它可以正常工作。我的赋值函数适用于O.K并且一直保持这种方式:
def assignment(df, centroids):
# We take the k=9 centroids keys to iterations based
for i in centroids.keys():
# sqrt((x1 - x2)^2 - (y1 - y2)^2)
# We calculate the distances between each data point and
# each one of the 9 centroids
# The distance_from_i column will have the distance value
# of each data point with reference to each centroid (Are 9 in total)
n = np.sqrt(
(X - centroids[i][0]) ** 2
+ (Y - centroids[i][1]) ** 2
)
# I want create a new column in a sioma_df dataframe named
# distance_from_i
sioma_df['distance_from_{}'.format(i)] = np.nan
sioma_df['distance_from_{}'.format(i)] = n
# We iterate by each distance value of each data point i with
# reference to each centroid j to compare and meet to what
# distance is more closest
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
# We create the closest column in the sioma_df dataframe,
# selecting the more minimum values in the column axis=1
sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
return df
# We execute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)