如何计算火花中TF-IDF的余弦相似度?

时间:2019-02-12 11:44:41

标签: python apache-spark tf-idf cosine-similarity cosine

我想计算TF-IDF在火花中的余弦相似度。这是spark教程中的代码。

 from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('tfdif').getOrCreate()
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (0, "I wish Java could use case classes"),
    (1, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
# alternatively, CountVectorizer can also be used to get term frequency vectors

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
for features_label in rescaledData.select("features", "label").take(3):
    print(features_label)

该怎么办?

0 个答案:

没有答案