计算熊猫数据帧的余弦距离

时间:2020-02-22 17:49:25

标签: python pandas dataframe pyspark pyspark-sql

我有一个形状为(70000 x 10)的熊猫数据框(例如df)。数据框的头部如下图所示:

const plugins = require('../../plugins');

plugins.createProgressBar({
  container: '#logged-user-level',
  height: 4,
  lineColor: '#4a46c8'
});

我想找到用户ID之间的成对余弦距离。例如:

cosine_distance(1000010249674395648,1000282310388932608)= 0.9758776214797362

我使用了here中提到的以下方法,但是由于CPU内存有限,在计算余弦距离时所有方法都抛出了内存错误:

  1. scikit-learn的余弦相似度:

                              0_x       1_x       2_x  ...       7_x       8_x       9_x
    userid                                             ...                              
    1000010249674395648  0.000007  0.999936  0.000007  ...  0.000007  0.000007  0.000007
    1000282310388932608  0.000060  0.816790  0.000060  ...  0.000060  0.000060  0.000060
    1000290654755450880  0.000050  0.000050  0.000050  ...  0.000050  0.191159  0.000050
    1000304603840241665  0.993157  0.006766  0.000010  ...  0.000010  0.000010  0.000010
    1000600081165438977  0.000064  0.970428  0.000064  ...  0.000064  0.000064  0.000064 
    
  2. 在线找到的更快的矢量化解决方案:

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_sim = cosine_similarity(df)
    

系统硬件概述:

def get_cosine_sim_df(df):
      topic_vectors = df.values
      norm_topic_vectors = topic_vectors / np.linalg.norm(topic_vectors, axis=-1)[:, np.newaxis]
      cosine_sim = np.dot(norm_topic_vectors, norm_topic_vectors.T)
      cosine_sim_df = pd.DataFrame(data = cosine_sim, index=df.index, columns=df.index)
      return cosine_sim_df

cosine_sim = get_cosine_sim_df(df)

我正在寻找一种高效且快捷的方法来计算CPU内存限制内的成对余弦距离,类似于pyspark数据帧或熊猫批处理技术,而不是一次处理所有数据帧。

任何建议/方法都值得赞赏。

仅供参考-我正在使用Python 3.7

1 个答案:

答案 0 :(得分:0)

我正在使用spark 2.4和python 3.7

# build spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .master("local") \
                    .appName("cos_sim") \
                    .config("spark.some.config.option", "some-value") \
                    .getOrCreate()

将您的熊猫df转换为火花dp

# Pandas to Spark
df = spark_session.createDataFrame(pand_df)

我生成了一些随机数据,

import random
import pandas as pd
from pyspark.sql.functions import monotonically_increasing_id 

def generate_random_data(num_usrs = 20, num_cols = 10):
    cols = [str(i)+"_x" for i in range(num_cols)]
    usrsdata = [ [random.random() for i in range(num_cols)] for i in range(num_usrs)]
#     return pd.DataFrame(usrsdata, columns = cols)
    return spark.createDataFrame(data = usrsdata, schema = cols)

df = generate_random_data()
df = df.withColumn("uid", monotonically_increasing_id())
df.limit(5).toPandas()   # just for nice display of df (df not actually changed)

spark_df_user_data

将df的列转换为特征向量

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembled = assembler.transform(df).select(['uid', 'features'])
assembled.limit(2).toPandas()

uid_features_df

归一化

from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="features", outputCol="norm")
data = normalizer.transform(assembled)
data.limit(2).toPandas()

normalized_features

计算成对的余弦相似度

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(data.select("uid", "norm").rdd\
        .map(lambda row: IndexedRow(row.uid, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()[:2]  # displaying first 2 users only

enter image description here

参考: Calculating the cosine similarity between all the rows of a dataframe in pyspark