在Spark独立集群上运行ALS程序时出现RDD分区问题

时间:2019-02-07 11:07:44

标签: apache-spark pyspark apache-spark-standalone

我在pyspark中的两个节点的Spark集群上运行我的ALS程序。如果禁用checkpointIntervalin als参数,它可以正常进行20次迭代。对于20次以上的迭代,它需要启用CheckpointInterval。我还给出了一个checkpoint目录。它给我以下错误。我没有正确解决问题。

同一程序在25次迭代的单机上运行良好。

我的错误是:

 Py4JJavaError: An error occurred while calling o2574.fit.
 : org.apache.spark.SparkException: Checkpoint RDD has a different number
of partitions from original RDD. Original RDD [ID: 3265, num of 
partitions:10]; Checkpoint RDD [ID: 3266, num of partitions: 0].

我的代码是:

 import time
 start = time.time()

 from pyspark.sql import SparkSession


 spark=SparkSession.builder.master('spark://172.16.12.200:7077')
.appName('new').getOrCreate()

 ndf = spark.read.json("Musical_Instruments_5.json")
 pd=ndf.select(ndf['asin'],ndf['overall'],ndf['reviewerID'])


 spark.sparkContext.setCheckpointDir("/home/npproject/jupyter_files /checkpoints")

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit,ParamGridBuilder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(pd.columns)-set(['overall'])) ]

pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(pd).transform(pd)
(training,test)=transformed.randomSplit([0.8, 0.2])
als=ALS(maxIter=25,regParam=0.09,rank=25,userCol="reviewerID_index",
itemCol="asin_index",ratingCol="overall",
checkpointInterval=5,coldStartStrategy="drop",
checkpointInterval=-1,nonnegative=True)
model=als.fit(training)
evaluator=RegressionEvaluator(metricName="rmse",
labelCol="overall",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
print("Rank: ",model.rank)
print("MaxIter: ",model._java_obj.parent().getMaxIter())
print("RegParam: ",model._java_obj.parent().getRegParam())

user_recs=model.recommendForAllUsers(10).show(20)

end = time.time()
print("execution time",end-start)

0 个答案:

没有答案