将RDD行与PySpark中的所有其他行相乘

时间:2018-07-17 10:27:11

标签: dataframe vector pyspark rdd matrix-multiplication

我有一个DenseVector对象的RDD,我想要:

  1. 选择这些向量之一(一行)
  2. 将此向量与所有其他向量行相乘以计算相似度(余弦)

基本上,我试图从RDD开始在向量和矩阵之间执行点积。作为参考,RDD包含使用Spark ML构建的TF-IDF值,该值提供SparseVectors的数据帧,并且已映射到DenseVectors以进行乘法。数据帧和相应的RDD分别称为$("#datetime-local").val() tfidf_df

我所做的是可行的(带有示例数据的完整脚本)

tfidf_rdd

已选择测试向量作为标签为1的向量。具有相似性的最终结果是(来自最后一个打印语句)from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.ml.feature import IDF, Tokenizer, CountVectorizer from pyspark.mllib.linalg import DenseVector import numpy as np sc = SparkContext() sqlc = SQLContext(sc) spark_session = SparkSession(sc) sentenceData = spark_session.createDataFrame([ (0, "I go to school school is good"), (1, "I like school"), (2, "I also like cinema") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="tokens") tokens_df = tokenizer.transform(sentenceData) # TF feats count_vectorizer = CountVectorizer(inputCol="tokens", outputCol="tf_features") model = count_vectorizer.fit(tokens_df) tf_df = model.transform(tokens_df) print(model.vocabulary) print(tf_df.rdd.take(5)) idf = IDF(inputCol="tf_features", outputCol="tf_idf_features", ) model = idf.fit(tf_df) tfidf_df = model.transform(tf_df) # Transform into RDD of dense vectors tfidf_rdd = tfidf_df.select("tf_idf_features") \ .rdd \ .map(lambda row: DenseVector(row.tf_idf_features.toArray())) print(tfidf_rdd.take(3)) # Select the test vector test_label = 1 vec = tfidf_df.filter(tfidf_df.label == test_label) \ .select('tf_idf_features') \ .rdd \ .map(lambda row: DenseVector(row.tf_idf_features.toArray())).collect()[0] rddB = tfidf_rdd.map(lambda row: np.dot(row/np.linalg.norm(row), vec/np.linalg.norm(vec))) \ .zipWithIndex() # print('*** multiplication', rddB.take(20)) # Sort the similarities sorted_rddB = rddB.sortByKey(False) print(sorted_rddB.take(20)) ,其中索引已用于追溯到原始数据集。

这很好,但看起来有点笨拙。我正在寻找最佳实践,以在数据帧(向量)的选定行与所有数据帧向量之间执行乘法。愿意就工作流程提出任何建议,尤其是与性能相关的建议。

0 个答案:

没有答案