Question

我有一个PySpark RDD，其中包含句子ID和向量：

Row(review_id='R1EGUNQON2J277_id_What', vector=DenseVector([0.033, 0.1455, -0.1428])), 
Row(review_id='R1FIU59Z4UXOIT_id_I wanted the gift car4 for MUSIC not movies', vector=DenseVector([-0.0121, 0.1022, -0.0883])), 
Row(review_id='R359VY6VMS5CKK_id_I had no idea that the amount is listed in US dollars', vector=DenseVector([0.0597, 0.0795, 0.1087])), 
Row(review_id='R359VY6VMS5CKK_id_I ended purchasing US $100 instead of CAD $100, which meant extra $30CAD', vector=DenseVector([0.1173, 0.1267, -0.0521]))]

我想计算列表的中心向量和前10个最接近的向量。
我将UDF用于余弦差和OOTB agg，min等功能，如下所示：

# Function for cosine similarity
def cos_sim(a,b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# UDF to get cosine difference
def diff(row):
  return row[0][0],row[1][0],1- cos_sim(row[0][1],row[1][1])

# Function to get the Center of Vectors
def get_center(pca_embeddings) :

  # Get the cartesian products and calculate cosine difference between each pairs
  cartesian_product = pca_embeddings.cartesian(pca_embeddings)
  cosine_rdd = cartesian_product.map(diff)
  new_df = cosine_rdd.map(lambda x: (x[0], x[1], x[2])).toDF(["A", "B" , "CosineDiff"])

  # Get the Average cosine differences fro each Review A
  w = Window().partitionBy(new_df["ReviewA"])
  mean_df = (new_df.withColumn("mean", avg("CosineDiff").over(w)))

  # Get the ranks of Reviews B for each Review A
  window = Window.partitionBy(mean_df['A']).orderBy(mean_df['CosineDiff'])
  rank_df = mean_df.select('*', rank().over(window).alias('rank'))

  # Collect top 11 Review Bs for each Review A including itself
  final_df = rank_df.filter(rank_df['rank']<12).groupBy('ReviewA').agg(F.collect_list("B").alias("Nearest Neighbours"),F.max("mean").alias('Mean Cosine'))

  # Get the Review A with smallest Average Cosine Difference
  final_row = final_df.select(F.min(
    F.struct("Mean Cosine", *(x for x in final_df.columns if x != "Mean Cosine")))).first()

  return final_row.asDict()['min(named_struct(NamePlaceholder(), Mean Cosine, NamePlaceholder(), ReviewA, NamePlaceholder(), Nearest Neighbours))'].asDict()

这永远需要在120000个向量的列表上运行。
如何优化代码以提供输出？
还，请给我一些建议，以便在EMR上更快地运行代码？

谢谢。

如何优化PySpark中向量之间的笛卡尔余弦差？

0 个答案: