将群集分配到存储在spark DataFrame中的数据点

时间:2018-04-10 02:21:24

标签: apache-spark dataframe pyspark spark-dataframe euclidean-distance

我有两个spark DataFrame。

架构DataFrame A(存储群集质心):

public void updateSchool(String username, String password, String school) throws Exception {
    User user = userDao.findByUsername(userName);
    if (user != null && user.getPassword().equals(password)) {
        user.setSchool(school);
        userDao.save(user);
    }
}

DataFrame B架构(数据点):

cluster_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos

DataFrame A中有大约100行,这意味着我有100个群集质心。我需要将DataFrame B中的每个实体映射到最接近的簇(就欧几里德距离而言)。

我该怎么做?

我想要一个带有架构的DataFrame: entity_id,cluster_id 作为我的最终结果。

2 个答案:

答案 0 :(得分:2)

我最终使用VectorAssembler来放置所有dimX列'值为单个列(对于每个数据帧)。

一旦完成,我只是使用UDF的组合来回答。

import numpy as np

featureCols = [dim1_pos, dim2_pos, ..., dimN_pos]
vecAssembler = VectorAssembler(inputCols=featureCols, outputCol="features")
dfA = vecAssembler.transform(dfA)
dfB = vecAssembler.transform(dfB)

def distCalc(a, b):
    return np.sum(np.square(a-b))

def closestPoint(point_x, centers):
    udf_dist = udf(lambda x: distCalc(x, point_x), DoubleType())
    centers = centers.withColumn('distance',udf_dist(centers.features))
    centers.registerTempTable('t1')
    bestIndex = #write a query to get minimum distance from centers df
    return bestIndex


udf_closestPoint = udf(lambda x: closestPoint(x, dfA), IntegerType())
dfB = dfB.withColumn('cluster_id',udf_closestPoint(dfB.features))

答案 1 :(得分:1)

如果Spark数据帧不是很大,您可以使用toPandas()将其变为pandas数据框并使用scipy.spatial.distance.cdist()(阅读this了解更多信息)

示例代码:

import pandas as pd
from scipy.spatial.distance import cdist

cluster = DataFrame({'cluster_id': [1, 2, 3, 7],
                'dim1_pos': [201, 204, 203, 204],
                'dim2_pos':[55, 40, 84, 31]})
entity = DataFrame({'entity_id': ['A', 'B', 'C'],
                'dim1_pos': [201, 204, 203],
                'dim2_pos':[55, 40, 84]})
cluster.set_index('cluster_id',inplace=True)
entity.set_index('entity_id',inplace=True)

result_metric= cdist(cluster, entity, metric='euclidean')

result_df = pd.DataFrame(result_metric,index=cluster.index.values,columns=entity.index.values)
print result_df

            A          B          C
1    0.000000  15.297059  29.068884
2   15.297059   0.000000  44.011362
3   29.068884  44.011362   0.000000
7   24.186773   9.000000  53.009433

然后,您可以使用idxmin()指定,以查找指标每行的最小对,如下所示:

# get the min. pair
result = DataFrame(result_df.idxmin(axis=1,skipna=True))
# turn the index value into column
result.reset_index(level=0, inplace=True)
# rename and order the columns
result.columns = ['cluster_id','entity_id']
result = result.reindex(columns=['entity_id','cluster_id'])
print result

  entity_id  cluster_id
0         A           1
1         B           2
2         C           3
3         B           7