我正在尝试运行批处理来计算所有用户的建议。
现在正在movielens数据集上运行它。
我正在尝试获取RDD中用户的recommendedProducts的Rating []并抛出错误: org.apache.spark.SparkException:RDD转换和操作只能由驱动程序调用,而不能在其他转换内部调用;
我理解这一定是因为RDD无法访问另一个RDD,但执行相同代码的替代方法是什么:
final MatrixFactorizationModel model = MatrixFactorizationModel.load(sc, "S3 PATH FOR MODEL");
JavaRDD<Integer> userIdRDD = data.map(
new Function<String, Integer>() {
public Integer call(String s) {
String[] sarray = s.split(",");
return Integer.parseInt(sarray[0]);
}
}
);
userIdRDD.distinct().foreach(
new VoidFunction<Integer>() {
public void call(Integer id) throws Exception {
System.out.println("User Id: " + id);
Rating[] recommendProducts = model.recommendProducts(id, 10);
List<Recommendations> userRecommendations = new ArrayList<Recommendations>();
for (int i = 0; i < recommendProducts.length; i++) {
userRecommendations.add(new Recommendations(i+1, id, recommendProducts[i].product()));
}
RedshiftUtility.batchInsert(jdbcURL, userRecommendations);
}
}
);
如果我使用collect方法将userIdRDD转换为List,那么我可以迭代该List并保存建议但我假设处理是在驱动程序而不是在集群中进行的。我想仅在并行env中运行处理和数据库插入,而不是在驱动程序中运行。
修改
我已经编辑了使用一个RDD运行的代码。如何测试它是否会在spark集群中并行运行?
JavaSparkContext jsc = SparkContextFactory.getSparkContext(accessKey, secretKey);
SparkContext sc = jsc.sc();
final MatrixFactorizationModel model = MatrixFactorizationModel.load(sc, "s3n://redshift-temp-copy/model");
JavaRDD<Tuple2<Object, Rating[]>> userRecommendationsRDD = model.recommendProductsForUsers(100).toJavaRDD();
userRecommendationsRDD.foreach(
new VoidFunction<Tuple2<Object, Rating[]>>() {
public void call(Tuple2<Object, Rating[]> objectTuple2) throws Exception {
List<Recommendations> userRecommendations = new ArrayList<Recommendations>();
for(int i = 0; i < objectTuple2._2().length; i++) {
System.out.println("Object Tuple: "+i+" > "+objectTuple2._1().toString()+" > "+objectTuple2._2()[i].product());
userRecommendations.add(new Recommendations(i+1, Integer.parseInt(objectTuple2._1().toString()), objectTuple2._2()[i].product()));
}
RedshiftUtility.batchInsert(jdbcURL, userRecommendations);
}
}
);