致敬程序员。
我最近刚开始使用pyspark,来自熊猫背景。我需要计算用户在数据中彼此之间的相似度。正如我在pyspark中找不到的那样,我求助于使用python字典创建相似性数据框。
但是,我用完了将嵌套字典转换为pyspark数据框的想法。 您能否为我提供实现这一理想结果的指导。
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from scipy.spatial import distance
spark = SparkSession.builder.getOrCreate()
from pyspark.sql import *
traindf = spark.createDataFrame([
('u11',[1, 2, 3]),
('u12',[4, 5, 6]),
('u13',[7, 8, 9])
]).toDF("user","rating")
traindf.show()
输出
+----+---------+
|user| rating|
+----+---------+
| u11|[1, 2, 3]|
| u12|[4, 5, 6]|
| u13|[7, 8, 9]|
+----+---------+
它希望在用户之间产生相似性并将其放入pyspark数据框。
parent_dict = {}
for parent_row in traindf.collect():
# print(parent_row['user'],parent_row['rating'])
child_dict = {}
for child_row in traindf.collect():
similarity = distance.cosine(parent_row['rating'],child_row['rating'])
child_dict[child_row['user']] = similarity
parent_dict[parent_row['user']] = child_dict
print(parent_dict)
输出:
{'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}
从这本词典中,我想构建一个pyspark数据框。
+-----+-----+--------------------+
|user1|user2| similarity|
+-----+-----+--------------------+
| u11| u11| 0.0|
| u11| u12| 0.0253681538029239|
| u11| u13| 0.0405880544333298|
| u12| u11| 0.0253681538029239|
| u12| u12| 0.0|
| u12| u13|0.001809107314273195|
| u13| u11| 0.0405880544333298|
| u13| u12|0.001809107314273195|
| u13| u13| 0.0|
+-----+-----+--------------------+
到目前为止,我尝试将dict转换为pandas数据框,然后将其转换为pyspark数据框。但是,我需要大规模进行此操作,并且我正在寻找更新颖的方法。
parent_user = []
child_user = []
child_similarity = []
for parent_row in traindf.collect():
for child_row in traindf.collect():
similarity = distance.cosine(parent_row['rating'],child_row['rating'])
child_user.append(child_row['user'])
child_similarity.append(similarity)
parent_user.append(parent_row['user'])
my_dict = {}
my_dict['user1'] = parent_user
my_dict['user2'] = child_user
my_dict['similarity'] = child_similarity
import pandas as pd
pd.DataFrame(my_dict)
df = spark.createDataFrame(pd.DataFrame(my_dict))
df.show()
输出:
+-----+-----+--------------------+
|user1|user2| similarity|
+-----+-----+--------------------+
| u11| u11| 0.0|
| u11| u12| 0.0253681538029239|
| u11| u13| 0.0405880544333298|
| u12| u11| 0.0253681538029239|
| u12| u12| 0.0|
| u12| u13|0.001809107314273195|
| u13| u11| 0.0405880544333298|
| u13| u12|0.001809107314273195|
| u13| u13| 0.0|
+-----+-----+--------------------+
答案 0 :(得分:1)
也许您可以执行以下操作:
import pandas as pd
from pyspark.sql import SQLContext
my_dic = {'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}
df = pd.DataFrame.from_dict(my_dic).unstack().to_frame().reset_index()
df.columns = ['user1', 'user2', 'similarity']
sqlCtx = SQLContext(sc) # sc is spark context
sqlCtx.createDataFrame(df).show()
答案 1 :(得分:0)
好的,现在您的问题更加清楚了。我假设您从用户评级的火花数据帧开始。 您想要做的是将此DF与自身进行外部连接,这将创建一个交叉乘积,其中包含所有可能的用户对(及其评级),包括同一用户的行重复两次(以后可以过滤)。计算包含相似性的新列。
答案 2 :(得分:0)
from pyspark.sql.types import *
import pyspark.sql.functions as psf
def cos_sim(a,b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
dot_udf = psf.udf(lambda x,y: cos_sim(x,y), FloatType())
data.alias("i").join(data.alias("j"), psf.col("i.user") != psf.col("j.user"))\
.select(
psf.col("i.user").alias("user1"),
psf.col("j.user").alias("user2"),
dot_udf("i.rating", "j.rating").alias("similarity"))\
.sort("similarity")\
.show()
根据需要输出:
+-----+-----+----------+
|user1|user2|similarity|
+-----+-----+----------+
| u11| u12|0.70710677|
| u13| u11|0.70710677|
| u11| u13|0.70710677|
| u12| u11|0.70710677|
| u12| u13| 1.0|
| u13| u12| 1.0|
+-----+-----+----------+