从ALS(交替最小二乘)算法

时间:2018-05-12 17:34:42

标签: python apache-spark pyspark google-cloud-platform

我们在Google云端点火环境中使用ALS(交替最小二乘法)方法向我们的用户推荐一些公司。为了提出建议,我们使用这个元组(userId,companyId,rating),评级值包括用户兴趣的组合,例如点击公司页面,将公司添加到收藏列表,从公司订购等等。(我们的方法与此链接非常相​​似:https://cloud.google.com/solutions/recommendations-using-machine-learning-on-compute-engine#Training-the-models

结果非常好,适用于我们的商业案例,但我们缺少一件对我们很重要的事情。 我们需要了解哪些用户被分组为相似的兴趣(a.k.a邻居),您知道有没有办法从pyspark的ALS算法中获得分组用户? 因此,我们可以根据该分组标记用户

修改

我已经尝试了下面的答案代码,但结果很奇怪,我的数据配对如下(userId,companyId,rating) 当我运行以下代码时,它会在同一个clusterId中对没有公共companyId的用户进行分组 例如,以下代码的结果之一是: (userId:471,clusterId:2) (userId:490,clusterId:2)

然而,用户471和490没有任何共同之处。我认为这里有一个错误

from __future__ import print_function

import sys
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import IntegerType
from pyspark.mllib.clustering import KMeans, KMeansModel

conf = SparkConf().setAppName("user_clustering")
sc = SparkContext(conf=conf)
sc.setCheckpointDir('checkpoint/')
sqlContext = SQLContext(sc)

CLOUDSQL_INSTANCE_IP = sys.argv[1]
CLOUDSQL_DB_NAME = sys.argv[2]
CLOUDSQL_USER = sys.argv[3]
CLOUDSQL_PWD  = sys.argv[4]

BEST_RANK = int(sys.argv[5])
BEST_ITERATION = int(sys.argv[6])
BEST_REGULATION = float(sys.argv[7])

TABLE_ITEMS  = "companies"
TABLE_RATINGS = "ml_ratings"
TABLE_RECOMMENDATIONS = "ml_reco"
TABLE_USER_CLUSTERS = "ml_user_clusters"

# Read the data from the Cloud SQL
# Create dataframes
#[START read_from_sql]
jdbcUrl    = 'jdbc:mysql://%s:3306/%s?user=%s&password=%s' % (CLOUDSQL_INSTANCE_IP, CLOUDSQL_DB_NAME, CLOUDSQL_USER, CLOUDSQL_PWD)
dfAccos = sqlContext.read.jdbc(url=jdbcUrl, table=TABLE_ITEMS)
dfRates = sqlContext.read.jdbc(url=jdbcUrl, table=TABLE_RATINGS)
print("Start Clustering Users")


# print("User Ratings:")
# dfRates.show(100)
#[END read_from_sql]

# Get all the ratings rows of our user

# print("Filtered User Ratings For User:",USER_ID)
# print("------------------------------")
# for x in dfUserRatings:
#      print(x)

#[START split_sets]
rddTraining, rddValidating, rddTesting = dfRates.rdd.randomSplit([6,2,2])
print("RDDTraining Size:",rddTraining.count()," RDDValidating Size:",rddValidating.count()," RDDTesting Size:",rddTesting.count())
print("Rank:",BEST_RANK," Iteration:",BEST_ITERATION," Regulation:",BEST_REGULATION)

#print("RDD Training Values:",rddTraining.collect())

#[END split_sets]

print("Start predicting")
#[START predict]
# Build our model with the best found values
# Rating, Rank, Iteration, Regulation
model = ALS.train(rddTraining, BEST_RANK, BEST_ITERATION, BEST_REGULATION)


# print("-----------------")
# print("User Groups Are Created")
# print("-----------------")

user_features = model.userFeatures().map(lambda x: x[1])
related_users = model.userFeatures().map(lambda x: x[0])
number_of_clusters = 10
model_kmm = KMeans.train(user_features, number_of_clusters, initializationMode = "random", runs = 3)
user_features_with_cluster_id = model_kmm.predict(user_features)
user_features_with_related_users = related_users.zip(user_features_with_cluster_id)
clusteredUsers = user_features_with_related_users.map(lambda x: (x[0],x[1]))
orderedUsers = clusteredUsers.takeOrdered(200,key = lambda x: x[1])

print("Ordered Users:")
print("--------------")
for x in orderedUsers:
    print(x)


#[START save user groups]
userGroupSchema = StructType([StructField("primaryUser", IntegerType(), True), StructField("groupId", IntegerType(), True)])
dfUserGroups = sqlContext.createDataFrame(orderedUsers,userGroupSchema)

try:
    dfUserGroups.write.jdbc(url=jdbcUrl, table=TABLE_USER_CLUSTERS, mode='append')
except:
    print("Data is already written to DB")


print("Written to DB and Finished Job")

由于

1 个答案:

答案 0 :(得分:0)

训练完模型后,您可以使用userFeatures()

获取用户特征向量

之后,您可以使用某个距离函数计算用户之间的距离,或使用像KMeans这样的聚类模型

因此,如果模型已经过培训:

enable