我的数据框是这样的:
schema = ['name','text']
rdd = sc.parallelize(["abc,xyz a","abc,xyz a","abc,xyz b","att,xcy b","att,vwa c","acy,xyz a","acy,xyz a"]) \
.map(lambda x : x.split(","))
df = sqlContext.createDataFrame(rdd,schema)
df.show()
+----+-----+
|name| text|
+----+-----+
| abc|xyz a|
| abc|xyz a|
| abc|xyz b|
| att|xcy b|
| att|vwa c|
| acy|xyz a|
| acy|xyz a|
+----+-----+
我想查看每个重复Text
的相似程度Name
所以,类似这样的东西(相似度得分是大约):
+----+-----------------+
|name| avg_text_jac_sim|
+----+-----------------+
| abc| 0.66 |
| att| 0.00 |
| acy| 1.00 |
+----+-----------------+
我使用以下示例为每个text
计算了LSH:
http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
hashedTFData = hashingTF.transform(wordsData)
mh = MinHashLSH(inputCol="rawFeatures", outputCol="hashes", numHashTables=5)
model=mh.fit(hashedTFData)
hashedDF=model.transform(hashedTFData)
hashedDF.head(1)
现在,使用model
,我可以使用Jaccard距离获得相似的文本。
# Compute the locality sensitive hashes for the input rows, then perform approximate
# similarity join.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
print("Approximately joining dfA and dfB on distance smaller than 0.6:")
model.approxSimilarityJoin(dfA, dfB, 0.6, distCol="JaccardDistance")\
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("JaccardDistance")).show()
但是,我坚持如何获取重复text
的{{1}}之间的实际JaccardDistance,这样我才能计算Jaccard相似度,然后计算平均值。