我有一个包含1,00,000条记录的数据集。我需要找到该数据集的欧几里得距离矩阵。它应该创建1,00,000 * 1,00,000矩阵。在python中,我们具有squareform(pdist(x))。由于我无法在rdd上执行相同的功能,如何在python的spark平台上执行此操作?
loc = "/usr/lib/spark/examples/src/main/python/Major_proj/datasets/bcell1.csv"
dat = sc.textFile(loc) # get dataset into rdd
header = dat.first() #extract header
dataRDD = dat.filter(lambda row : row != header) #filter out header
csv_rdd = dataRDD.map(lambda row: row.split(",")) #split all the rows
print(csv_rdd.take(5)) #print out first n rows
y = csv_rdd.map(lambda x: np.array(x, dtype=np.float32)) #convert the string type to float
D = squareform(pdist(y)) # i should get euclidean distance matrix and store it into D.
print("distance matrix is: ",D,"with shape is:",D.shape)