分别处理SPARK中的多个文件

时间:2016-03-21 14:28:07

标签: java apache-spark

我需要帮助实现Apache Spark的一个工作流程。我的任务是下一步:

  1. 我有几个CSV文件作为源数据。注意:这些文件可能有不同的布局

  2. 我有信息的元数据,我需要解析每个文件(这不是问题)

  3. 主要目标:结果是包含多个附加列的源文件。我必须更新每个源文件而不加入一个输出范围。例如:源10文件 - > 10个结果文件和每个结果文件只有来自相应源文件的数据。

  4. 据我所知,Spark可以通过掩码打开许多文件:

     var source = sc.textFile("/source/data*.gz");
    

    但在这种情况下,我无法识别文件的哪一行。如果我获得源文件列表并尝试按以下方案处理:

    JavaSparkContext sc = new JavaSparkContext(...);
    List<String> files = new ArrayList() //list of source files full name's
    for(String f : files)
    {
       JavaRDD<String> data = sc.textFile(f);
       //process this file with Spark
       outRdd.coalesce(1, true).saveAsTextFile(f + "_out"); 
    }
    

    但在这种情况下,我将以顺序模式处理所有文件。

    我的问题是:我如何以并行模式处理多个文件?例如:一个文件 - 一个执行器?

    我尝试通过包含源数据的简单代码来实现它:

    //JSON file with paths to 4 source files, saved in inData variable
    {
    "files": [
        {
            "name": "/mnt/files/DigilantDaily_1.gz",
            "layout": "layout_1"
        },
        {
            "name": "/mnt/files/DigilantDaily_2.gz",
            "layout": "layout_2"
        },
        {
            "name": "/mnt/files/DigilantDaily_3.gz",
            "layout": "layout_3"
        },
        {
            "name": "/mnt/files/DigilantDaily_4.gz",
            "layout": "layout_4"
        }
      ]
     }
    
    sourceFiles= new ArrayList<>();
        JSONObject jsFiles = (JSONObject) new JSONParser().parse(new FileReader(new File(inData)));
        Iterator<JSONObject> iterator = ((JSONArray)jsFiles.get("files")).iterator();
        while (iterator.hasNext()){
            SourceFile sf = new SourceFile();
            JSONObject js = iterator.next();
            sf.FilePath = (String) js.get("name");
            sf.MetaPath = (String) js.get("layout");
            sourceFiles.add(sf);
        }
    
        SparkConf sparkConf = new SparkConf()
                .setMaster("local[*]")
                .setAppName("spark-app");
        final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
    
        try {
    
            final Validator validator = new Validator();
    
            ExecutorService pool = Executors.newFixedThreadPool(4);
    
            for(final SourceFile f : sourceFiles)
            {
                pool.execute(new Runnable() {
    
                    @Override
                    public void run() {
    
                        final Path inFile = Paths.get(f.FilePath);
    
                        JavaRDD<String> d1 = sparkContext
                                .textFile(f.FilePath)
                                .filter(new Function<String, Boolean>() {
                                    @Override
                                    public Boolean call(String s) throws Exception {
                                        return validator.parseRow(s);
                                    }
                                });
    
                        JavaPairRDD<String, Integer> d2 = d1.mapToPair(new PairFunction<String, String, Integer>() {
                            @Override
                            public Tuple2<String, Integer> call(String s) throws Exception {
                                String userAgent = validator.getUserAgent(s);
                                return new Tuple2<>(DeviceType.deviceType(userAgent), 1);
                            }
                        });
    
                        JavaPairRDD<String, Integer> d3 = d2.reduceByKey(new Function2<Integer, Integer, Integer>() {
                            @Override
                            public Integer call(Integer val1, Integer val2) throws Exception {
                                return val1 + val2;
                            }
                        });
    
                        d3.coalesce(1, true)
                                .saveAsTextFile(outFolder + "/" + inFile.getFileName().toString());//, org.apache.hadoop.io.compress.GzipCodec.class);
                    }
                });
            }
            pool.shutdown();
            pool.awaitTermination(60, TimeUnit.MINUTES);
        } catch (Exception e) {
            throw e;
        } finally {
            if (sparkContext != null) {
                sparkContext.stop();
            }
        }
    

    但是这段代码失败了,例外:

    Exception in thread "pool-13-thread-2" Exception in thread "pool-13-thread-3" Exception in thread "pool-13-thread-1" Exception in thread "pool-13-thread-4" java.lang.Error: org.apache.spark.SparkException: Task not serializable
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
    at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
    at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
    at org.apache.spark.rdd.RDD.filter(RDD.scala:334)
    at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)
    at append.dev.App$1.run(App.java:87)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    ... 2 more
    

    我想知道我哪里有错误?

    谢谢你的帮助!

2 个答案:

答案 0 :(得分:0)

您可以使用sc.wholeTextFiles(dirname)获取(文件名,内容)对的rdd并映射到该

答案 1 :(得分:0)

我使用了类似的多线程方法,效果很好。我相信问题位于你定义的内部类中。

在单独的类上创建runnable / callable,并确保它与您提交的jar一起传递给Spark。此外,实现serializable,因为您隐式将状态传递给您的函数(f.FilePath)。