如何在Pyspark中计算欧几里得距离矩阵?

时间:2019-05-23 11:45:04

标签: apache-spark pyspark bigdata distributed-computing distance-matrix

我有一个包含1,00,000条记录的数据集。我需要找到该数据集的欧几里得距离矩阵。它应该创建1,00,000 * 1,00,000矩阵。在python中,我们具有squareform(pdist(x))。由于我无法在rdd上执行相同的功能,如何在python的spark平台上执行此操作?

loc = "/usr/lib/spark/examples/src/main/python/Major_proj/datasets/bcell1.csv"
dat = sc.textFile(loc) # get dataset into rdd
header = dat.first() #extract header
dataRDD = dat.filter(lambda row : row != header)   #filter out header

csv_rdd = dataRDD.map(lambda row: row.split(","))  #split all the rows
print(csv_rdd.take(5))                             #print out first n rows
y = csv_rdd.map(lambda x: np.array(x, dtype=np.float32)) #convert the string type to float

D = squareform(pdist(y)) # i should get euclidean distance matrix and store it into D.
print("distance matrix is: ",D,"with shape is:",D.shape)

0 个答案:

没有答案
相关问题