如何使用PySpark获得文本列各行之间的平均Jaccard相似度

时间:2019-03-26 23:45:59

标签: apache-spark pyspark apache-spark-mllib

我的数据框是这样的:

schema = ['name','text']
rdd = sc.parallelize(["abc,xyz a","abc,xyz a","abc,xyz b","att,xcy b","att,vwa c","acy,xyz a","acy,xyz a"]) \
        .map(lambda x : x.split(","))
df = sqlContext.createDataFrame(rdd,schema)
df.show()


+----+-----+
|name| text|
+----+-----+
| abc|xyz a|
| abc|xyz a|
| abc|xyz b|
| att|xcy b|
| att|vwa c|
| acy|xyz a|
| acy|xyz a|
+----+-----+

我想查看每个重复Text的相似程度Name

所以,类似这样的东西(相似度得分是大约):

+----+-----------------+
|name| avg_text_jac_sim|
+----+-----------------+
| abc| 0.66            |
| att| 0.00            |
| acy| 1.00            |
+----+-----------------+

我使用以下示例为每个text计算了LSH: http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing

    from pyspark.ml.feature import HashingTF, IDF, Tokenizer

    tokenizer = Tokenizer(inputCol="text", outputCol="words")
    wordsData = tokenizer.transform(df)

    hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
    hashedTFData = hashingTF.transform(wordsData)

    mh = MinHashLSH(inputCol="rawFeatures", outputCol="hashes", numHashTables=5)
    model=mh.fit(hashedTFData)

    hashedDF=model.transform(hashedTFData)
    hashedDF.head(1)

现在,使用model,我可以使用Jaccard距离获得相似的文本。

# Compute the locality sensitive hashes for the input rows, then perform approximate
# similarity join.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
print("Approximately joining dfA and dfB on distance smaller than 0.6:")
model.approxSimilarityJoin(dfA, dfB, 0.6, distCol="JaccardDistance")\
    .select(col("datasetA.id").alias("idA"),
            col("datasetB.id").alias("idB"),
            col("JaccardDistance")).show() 

但是,我坚持如何获取重复text的{​​{1}}之间的实际JaccardDistance,这样我才能计算Jaccard相似度,然后计算平均值。

0 个答案:

没有答案