Question

对于推荐系统，我需要计算整个Spark DataFrame的所有列之间的余弦相似度。

在Pandas，我曾经这样做过：

import sklearn.metrics as metrics
import pandas as pd

df= pd.DataFrame(...some dataframe over here :D ...)
metrics.pairwise.cosine_similarity(df.T,df.T)

在列之间生成相似矩阵（因为我使用了转置）

有没有办法在Spark（Python）中做同样的事情？

（我需要将它应用于由数千万行和数千列组成的矩阵，这就是为什么我需要在Spark中执行此操作）

Answer 1

您可以在columnSimilarities()上使用内置的RowMatrix方法，既可以计算精确的余弦相似度，也可以使用DIMSUM方法估算它，这会快得多对于较大的数据集。使用上的差异是，对于后者，您必须指定threshold。

这是一个可重复的小例子：

from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])

# Convert to RowMatrix
mat = RowMatrix(rows)

# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)

# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
 MatrixEntry(1, 2, 0.998441152599),
 MatrixEntry(0, 1, 0.997463284056)]

Apache Spark Python Cosine与DataFrames的相似性

1 个答案: