java:从Spark独立集群将txt文件读入javaRDD

时间:2019-06-06 10:56:36

标签: java apache-spark rdd apache-spark-dataset apache-spark-standalone

方案:在独立集群上的指定位置以文字形式dataset书写。将这些文件读回javaRDD。 观察结果:

环境设置

  • Linux上的Spark独立群集,包含主服务器和从属服务器(主服务器,另一个服务器)
  • 带有“ spark-submit”的Java代码正在通过IntelliJ在本地Windows计算机上运行。驱动程序在本地计算机上。

    1. 将数据集写入文本-

dataset.select("column1").toDF().write().mode(SaveMode.Overwrite).text("/home/mountedLocation/TextOutput");

这将创建文件夹结构-

/home/mountedLocation
    /TextOutput
        /_temporary/0/_temporary/attempt_<some_attempt_number_generated_by_spark>
        /part-00000-<job_id>-c000.txt
        /part-00001-<job_id>-c000.txt
        /part-00000-<job_id>-c000.txt.crc
        /part-00001-<job_id>-c000.txt.crc

文本文件具有所有预期的数据,没有丢失任何内容。但是即使尝试了sparkContext.hadoopConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs","true");

,也不会生成_SUCCESS文件。

主日志:

INFO Master: Registering app appName
INFO Master: Registered app appName with ID app-20190606154601-0000
INFO Master: Launching executor app-20190606154601-0000/0 on worker worker-20190606154533-<worker1_ip>-36704
INFO Master: Launching executor app-20190606154601-0000/1 on worker worker-20190606154535-<worker2_ip>-39447
INFO Master: Received unregister request from application app-20190606154601-0000
INFO Master: Removing app app-20190606154601-0000
INFO Master: <localMachine_ip>:65087 got disassociated, removing it.
INFO Master: <localMachine_ip>:65078 got disassociated, removing it.
WARN Master: Got status update for unknown executor app-20190606154601-0000/0
WARN Master: Got status update for unknown executor app-20190606154601-0000/1

Worker1日志:

INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/bin/java" "-cp" "/home/sparkDirectory/conf/:/home/sparkDirectory/jars/*" "-Xmx1024M" "-Dspark.driver.port=65078" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@<master_ip>:65078" "--executor-id" "0" "--hostname" "<master_ip>" "--cores" "2" "--app-id" "app-20190606154601-0000" "--worker-url" "spark://Worker@<worker_ip>:36704"
INFO Worker: Asked to kill executor app-20190606154601-0000/0
INFO ExecutorRunner: Runner thread for executor app-20190606154601-0000/0 interrupted
INFO ExecutorRunner: Killing process!
INFO Worker: Executor app-20190606154601-0000/0 finished with state KILLED exitStatus 0
INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 0
INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20190606154601-0000, execId=0)
INFO Worker: Cleaning up local directories for application app-20190606154601-0000
INFO ExternalShuffleBlockResolver: Application app-20190606154601-0000 removed, cleanupLocalDirs = true

sparkUI的stderr:

INFO FileOutputCommitter: Saved output of task 'attempt_20190606154601_0000_m_000003_0' to file:/home/mountedLocation/TextOutput
INFO SparkHadoopMapRedUtil: attempt_20190606154601_0000_m_000003_0: Committed
INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 2032 bytes result sent to driver
INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
  1. 读取生成的.txt文件-

方法1-

SparkSession ss = SparkSession.getActiveSession().get();
    Dataset set = ss.read().textFile("/home/mountedLocation/TextOutput");
    set.show();

方法2-

JavaSparkContext sc = new JavaSparkContext(sparkContext);
            JavaRDD<String> lines = sc.textFile(files);
            for(String line:lines.collect()){
                System.out.println(line);
            }

均返回一个空的dataSet。

观察:

  • 在Windows本地计算机上运行时,相同的代码未生成_temporary和基础结构。生成的_SUCCESS文件。
  • 该代码能够将所有文本文件读取到单个数据集中并对其进行show()处理。

有人可以帮助我了解问题出在哪里以及如何读取数据吗?欢迎所有指针和建议!

0 个答案:

没有答案