Question

我有一个包含1,00,000条记录的数据集。我需要找到该数据集的欧几里得距离矩阵。它应该创建1,00,000 * 1,00,000矩阵。在python中，我们具有squareform（pdist（x））。由于我无法在rdd上执行相同的功能，如何在python的spark平台上执行此操作？

loc = "/usr/lib/spark/examples/src/main/python/Major_proj/datasets/bcell1.csv"
dat = sc.textFile(loc) # get dataset into rdd
header = dat.first() #extract header
dataRDD = dat.filter(lambda row : row != header)   #filter out header

csv_rdd = dataRDD.map(lambda row: row.split(","))  #split all the rows
print(csv_rdd.take(5))                             #print out first n rows
y = csv_rdd.map(lambda x: np.array(x, dtype=np.float32)) #convert the string type to float

D = squareform(pdist(y)) # i should get euclidean distance matrix and store it into D.
print("distance matrix is: ",D,"with shape is:",D.shape)

如何在Pyspark中计算欧几里得距离矩阵？

0 个答案: