Spark ML-无法使用MatrixFactorizationModel

时间:2016-08-17 21:43:37

标签: java apache-spark apache-spark-mllib

我正在尝试使用Spark协作过滤实现推荐系统。

首先我准备模型并保存到磁盘:

MatrixFactorizationModel model = trainModel(inputDataRdd);  
model.save(jsc.sc(), "/op/tc/model/");

当我使用单独的进程加载模型时,程序失败并出现以下异常:
代码:

   static JavaSparkContext jsc ;
    private static Options options;
    static{
        SparkConf conf = new SparkConf().setAppName("TC recommender application");
        conf.set("spark.driver.allowMultipleContexts", "true");
        jsc= new JavaSparkContext(conf);
     }
MatrixFactorizationModel model = MatrixFactorizationModel.load(jsc.sc(),
                "/op/tc/model/");

例外:

  

线程中的异常" main" java.io.IOException:不是文件:   maprfs:/ OP / TC /模型/数据           at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324)           在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:239)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:237)           在scala.Option.getOrElse(Option.scala:120)           在org.apache.spark.rdd.RDD.partitions(RDD.scala:237)           在org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:239)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:237)           在scala.Option.getOrElse(Option.scala:120)           在org.apache.spark.rdd.RDD.partitions(RDD.scala:237)           在org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:239)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:237)           在scala.Option.getOrElse(Option.scala:120)           在org.apache.spark.rdd.RDD.partitions(RDD.scala:237)           在org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:239)           在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:237)           在scala.Option.getOrElse(Option.scala:120)           在org.apache.spark.rdd.RDD.partitions(RDD.scala:237)           在org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)           在org.apache.spark.rdd.RDD $$ anonfun $ aggregate $ 1.apply(RDD.scala:1114)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111)           在org.apache.spark.rdd.RDD.withScope(RDD.scala:316)           在org.apache.spark.rdd.RDD.aggregate(RDD.scala:1107)           at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.countApproxDistinctUserProduct(MatrixFactorizationModel.scala:96)           在org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:126)           在com.aexp.cxp.recommendation.ProductRecommendationIndividual.main(ProductRecommendationIndividual.java:62)           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           在java.lang.reflect.Method.invoke(Method.java:497)           在org.apache.spark.deploy.SparkSubmit $ .org $ apache $ spark $ deploy $ SparkSubmit $$ runMain(SparkSubmit.scala:742)           在org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:181)           在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:206)           在org.apache.spark.deploy.SparkSubmit $ .main(SparkSubmit.scala:121)           在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我需要设置加载模型的配置吗?任何建议都会有很大的帮助。

1 个答案:

答案 0 :(得分:1)

在Spark中,与任何其他分布式计算框架一样,在尝试调试代码时,了解代码的运行位置非常重要。访问各种类型也很重要。例如,在YARN中,您将拥有:

  • 如果您自己录制,主人会记录
  • 聚合的从属日志(感谢YARN,有用的功能!)
  • YARN节点管理器(例如将告诉您容器被杀的原因等)

如果你从一开始就不看正确的地方,那么深入研究Spark问题可能非常耗时。现在更具体地说明这个问题,你有一个清晰的堆栈跟踪,但情况并非总是如此,所以你应该利用它来为你带来优势。

堆栈跟踪的顶部是

  

线程中的异常" main" java.io.IOException:不是文件:   maprfs:/ op / tc / model / data at   org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324)   在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)   在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:239)   在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:237)   在scala.Option.getOrElse(Option.scala:120)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:237)at   org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)   在

如您所见,Spark作业在失败时正在执行map操作。谁执行map?因此,您必须确保所有从站上的文件都可用,而不仅仅是在主站上。

更一般地说,您总是需要在为主服务器编写的代码和为从服务器编写的代码之间明确区分。这将有助于您检测此类交互,以及对不可序列化对象的引用以及此类常见错误。