Question

我有以下sioma_df数据框：

这些是sioma_df形状和列索引。它有13807行和37列：

sioma_df.columns
(13807, 37)
Index(['Luz (lux)', 'Precipitación (ml)', 'Temperatura (°C)',
       'Velocidad del Viento (km/h)', 'E', 'N', 'NE', 'NO', 'O', 'S', 'SE',
       'SO', 'PORVL2N1', 'PORVL2N2', 'PORVL4N1', 'PORVL5N1', 'PORVL6N1',
       'PORVL7N1', 'PORVL8N1', 'PORVL9N1', 'PORVL10N1', 'PORVL13N1',
       'PORVL14N1', 'PORVL15N1', 'PORVL16N1', 'PORVL16N2', 'PORVL18N1',
       'PORVL18N2', 'PORVL18N3', 'PORVL18N4', 'PORVL21N1', 'PORVL21N2',
       'PORVL21N3', 'PORVL21N4', 'PORVL21N5', 'PORVL24N1', 'PORVL24N2'],
      dtype='object')

我想应用 k -means 算法，我决定在随机初始化阶段我会有 k=9质心

# Turn the dataframe to numpy array
sioma_numpy = sioma_df.get_values()

k=9

# Create a dictionary with the centroids coordinates 
centroids = {
    i + 1: [np.random.randint(0, np.max(sioma_numpy)), np.random.randint(0, np.max(sioma_numpy))]
    for i in range(k)
}

我在应用群集之前绘制数据

# I get each column individually into an array 

c1 = sioma_df['Luz (lux)'].values
c2 = sioma_df['Precipitación (ml)'].values
c3 = sioma_df['Temperatura (°C)'].values
c4 = sioma_df['Velocidad del Viento (km/h)'].values
c5 = sioma_df['PORVL2N1'].values
c6 = sioma_df['PORVL2N2'].values
c7 = sioma_df['PORVL4N1'].values
c8 = sioma_df['PORVL5N1'].values
c9 = sioma_df['PORVL6N1'].values
c10 = sioma_df['PORVL7N1'].values
c11 = sioma_df['PORVL8N1'].values
c12 = sioma_df['PORVL9N1'].values
c13 = sioma_df['PORVL10N1'].values
c14 = sioma_df['PORVL13N1'].values
c15 = sioma_df['PORVL14N1'].values
c16 = sioma_df['PORVL15N1'].values
c17 = sioma_df['PORVL16N1'].values
c18 = sioma_df['PORVL16N2'].values
c19 = sioma_df['PORVL18N1'].values
c20 = sioma_df['PORVL18N2'].values
c21 = sioma_df['PORVL18N3'].values
c22 = sioma_df['PORVL18N4'].values
c23 = sioma_df['PORVL18N4'].values
c24 = sioma_df['PORVL21N1'].values
c25 = sioma_df['PORVL21N2'].values
c26 = sioma_df['PORVL21N3'].values
c27 = sioma_df['PORVL21N4'].values
c28 = sioma_df['PORVL21N5'].values
c29 = sioma_df['PORVL24N1'].values
c30 = sioma_df['E'].values
c31 = sioma_df['N'].values
c32 = sioma_df['NE'].values
c33 = sioma_df['NO'].values
c34 = sioma_df['O'].values
c35 = sioma_df['S'].values
c36 = sioma_df['SE'].values
c37 = sioma_df['S'].values

""" I generate the X and Y coordinates points of previous c1 to c36 
variables above. With zip I've associate between each Ci and store in 
a list to will represent array X and array Y
"""
X = np.array(list(zip(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18)))
print( " ARRAY X" +'\n', X, '\n' )
Y = np.array(list(zip(c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c33,c34,c35,c36,)))
print( " ARRAY Y" +'\n', Y, '\n' )

然后，我生成了对x，y质心坐标。

我想从分配阶段开始，我将数据点分配给最近的质心。我有以下内容：

def assignment(df, centroids):
    # We take the k=9 centroids keys to iterations based
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2)
        # I want create a new column in a sioma_df dataframe named 
        #distance_from_i
        sioma_df['distance_from_{}'.format(i)] = (
            # We calculate the distances between each data point and 
            # each one of the 9 centroids

            # The distance_from_i column will have the distance value 
            # of each data point with reference to each centroid  (Are 9 in total) 
            np.sqrt(
                (X - centroids[i][0]) ** 2
                + (Y - centroids[i][1]) ** 2
            )
        )
    # We iterate by each distance value of each data point i with 
    # reference to each centroid j to compare and meet to what 
    # distance is more closest 
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    # We create the closest column in the sioma_df dataframe,  
    # selecting the more minimum values in the column axis=1:
    sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
    sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
    sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
    return df

# We wxecute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)

但是当我执行我的代码时，我收到以下错误：

KeyError: 'distance_from_1'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-160-b96e0351c13d> in <module>()
     24 
     25 # 
---> 26 df = assignment(sioma_df, centroids)
     27 print(df.head)

<ipython-input-160-b96e0351c13d> in assignment(df, centroids)
     11             np.sqrt(
     12                 (X - centroids[i][0]) ** 2
---> 13                 + (Y - centroids[i][1]) ** 2
     14             )
     15         )


ValueError: Wrong number of items passed 18, placement implies 1

这表明您试图将太多数据放在太少的内存位置。在这种情况下，等式右边的值在这种情况下

sioma_df['distance_from'] = np.sqrt((X - centroids[i][0]) ** 2 + (Y - centroids[i][1]) ** 2)

我真的不明白如何在正确分配的意义上解决这个不便之处;这让我很难排除故障。

任何支持我指向正确方向的人都将受到高度赞赏

Answer 1

我的问题是np.sqrt(…)语句没有返回一维数组。由于X和Y numpy数组的长度，每一行col位置都需要1个值，但它接收的数组长度为18个元素。

numpy数组的操作是元素方面的，因此可能不会改变正在操作的数组的形状。然后，当我想要创建新的distance_from_i列并进行此操作时：

sioma_df['distance_from_{}'.format(i)] = (
            np.sqrt(
                (X - centroids[i][0]) ** 2
                + (Y - centroids[i][1]) ** 2
            )
        )

我分配给这个distance_from_i列，而不是一维数组，这是一个必须接收或接受的能力，否则，我的distance_from_i列（每行，col）收到一个长度为18个元素的数组，这就是错误的原因

ValueError: Wrong number of items passed 18, placement implies 1

然后，我已将新的distance_from_i列初始化为NaN值，然后为其分配np.sqrt(…)语句的结果值，并且它可以正常工作。我的赋值函数适用于O.K并且一直保持这种方式：

def assignment(df, centroids):
    # We take the k=9 centroids keys to iterations based
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2) 
        # We calculate the distances between each data point and 
        # each one of the 9 centroids

        # The distance_from_i column will have the distance value 
        # of each data point with reference to each centroid  (Are 9 in total) 
        n = np.sqrt(
                (X - centroids[i][0]) ** 2
                + (Y - centroids[i][1]) ** 2
        )
        # I want create a new column in a sioma_df dataframe named 
        # distance_from_i
        sioma_df['distance_from_{}'.format(i)] =  np.nan 
        sioma_df['distance_from_{}'.format(i)] =  n

    # We iterate by each distance value of each data point i with 
    # reference to each centroid j to compare and meet to what 
    # distance is more closest 
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]

    # We create the closest column in the sioma_df dataframe,  
    # selecting the more minimum values in the column axis=1
    sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
    sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
    sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
    return df

# We execute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)

传递的项目数量错误 - 向数据帧列添加numpy数组内容

1 个答案: