我有一个DenseVector对象的RDD,我想要:
基本上,我试图从RDD开始在向量和矩阵之间执行点积。作为参考,RDD包含使用Spark ML构建的TF-IDF值,该值提供SparseVectors的数据帧,并且已映射到DenseVectors以进行乘法。数据帧和相应的RDD分别称为$("#datetime-local").val()
和tfidf_df
。
我所做的是可行的(带有示例数据的完整脚本)
tfidf_rdd
已选择测试向量作为标签为1的向量。具有相似性的最终结果是(来自最后一个打印语句)from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import IDF, Tokenizer, CountVectorizer
from pyspark.mllib.linalg import DenseVector
import numpy as np
sc = SparkContext()
sqlc = SQLContext(sc)
spark_session = SparkSession(sc)
sentenceData = spark_session.createDataFrame([
(0, "I go to school school is good"),
(1, "I like school"),
(2, "I also like cinema")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="tokens")
tokens_df = tokenizer.transform(sentenceData)
# TF feats
count_vectorizer = CountVectorizer(inputCol="tokens",
outputCol="tf_features")
model = count_vectorizer.fit(tokens_df)
tf_df = model.transform(tokens_df)
print(model.vocabulary)
print(tf_df.rdd.take(5))
idf = IDF(inputCol="tf_features",
outputCol="tf_idf_features",
)
model = idf.fit(tf_df)
tfidf_df = model.transform(tf_df)
# Transform into RDD of dense vectors
tfidf_rdd = tfidf_df.select("tf_idf_features") \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray()))
print(tfidf_rdd.take(3))
# Select the test vector
test_label = 1
vec = tfidf_df.filter(tfidf_df.label == test_label) \
.select('tf_idf_features') \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray())).collect()[0]
rddB = tfidf_rdd.map(lambda row: np.dot(row/np.linalg.norm(row),
vec/np.linalg.norm(vec))) \
.zipWithIndex()
# print('*** multiplication', rddB.take(20))
# Sort the similarities
sorted_rddB = rddB.sortByKey(False)
print(sorted_rddB.take(20))
,其中索引已用于追溯到原始数据集。
这很好,但看起来有点笨拙。我正在寻找最佳实践,以在数据帧(向量)的选定行与所有数据帧向量之间执行乘法。愿意就工作流程提出任何建议,尤其是与性能相关的建议。