我是Spark的新手,我正在从这里运行隐式协作装备mllib。当我在我的数据上运行以下代码时,我收到以下错误:
ValueError: RDD is empty
这是我的数据:
101,1000010,1
101,1000011,1
101,1000015,1
101,1000017,1
101,1000019,1
102,1000010,1
102,1000012,1
102,1000019,1
103,1000011,1
103,1000012,1
103,1000013,1
103,1000014,1
103,1000017,1
104,1000010,1
104,1000012,1
104,1000013,1
104,1000014,1
104,1000015,1
104,1000016,1
104,1000017,1
105,1000017,1
我的代码:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("s3://xxxxxxxxxxxx.csv")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(l[0], l[1], float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
alpha = 0.01
model = ALS.trainImplicit(ratings, rank, numIterations, alpha)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
# convert pyspark pipeline to DF
ratesAndPreds.toDF().show()