PySpark 2.2中的Jaccard相似性

时间:2018-05-15 18:15:56

标签: python pyspark

我正在尝试使用Spark ML Lib中指定的技术实现Jaccard相似性。我有一个用户和项目的数据框。相似度得分为零,我得到错误的结果。我做错了什么?

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.linalg import SparseVector, DenseVector
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors 
from pyspark.sql import Row 
from pyspark.ml.feature import VectorAssembler

df = sc.parallelize([ \
                 Row(CUST_ID=1, ITEM_ID=1),\
                 Row(CUST_ID=1, ITEM_ID=2),\
                 Row(CUST_ID=2, ITEM_ID=1),\
                 Row(CUST_ID=2, ITEM_ID=2),\
                 Row(CUST_ID=2, ITEM_ID=3)
                ]).toDF()

dfpivot=(df
    .groupBy("CUST_ID").pivot("ITEM_ID").count().na.fill(0)
                          )


input_cols = [x for x in dfpivot.columns if x !="CUST_ID"]

dfassembler1 = (VectorAssembler(inputCols=input_cols, outputCol="features")
.transform(dfpivot)
.select("CUST_ID", "features"))

mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=3)
model = mh.fit(dfassembler)

# Feature Transformation
print("The hashed dataset where hashed values are stored in the column 

'散列':“)     model.transform(dfassembler).show(3,假)

dfA=dfassembler
dfB=dfassembler

print("Approximately joining dfA and dfB on distance smaller than 0.6:")
model.approxSimilarityJoin(dfA, dfB, 0.3, distCol="JaccardDistance")\
.select(col("datasetA.CUST_ID").alias("idA"),
        col("datasetB.CUST_ID").alias("idB"),
        col("JaccardDistance")).show()

Approximately joining dfA and dfB on distance smaller than 0.6:
+---+---+---------------+
                          +---+---+---------------+
 |idA|idB|JaccardDistance|
 +---+---+---------------+
 |  1|  1|            0.0|
 |  2|  2|            0.0|
    +---+---+---------------+

1 个答案:

答案 0 :(得分:0)

实际上JaccardDistance是距离得分,相似度得分为1-JaccardDistance。在你的情况下,idA和idB是同一对。所以相似度为1. JaccardDistance = 0。