方案:在独立集群上的指定位置以文字形式dataset
书写。将这些文件读回javaRDD
。
观察结果:
环境设置:
带有“ spark-submit”的Java代码正在通过IntelliJ在本地Windows计算机上运行。驱动程序在本地计算机上。
dataset.select("column1").toDF().write().mode(SaveMode.Overwrite).text("/home/mountedLocation/TextOutput");
这将创建文件夹结构-
/home/mountedLocation
/TextOutput
/_temporary/0/_temporary/attempt_<some_attempt_number_generated_by_spark>
/part-00000-<job_id>-c000.txt
/part-00001-<job_id>-c000.txt
/part-00000-<job_id>-c000.txt.crc
/part-00001-<job_id>-c000.txt.crc
文本文件具有所有预期的数据,没有丢失任何内容。但是即使尝试了sparkContext.hadoopConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs","true");
主日志:
INFO Master: Registering app appName
INFO Master: Registered app appName with ID app-20190606154601-0000
INFO Master: Launching executor app-20190606154601-0000/0 on worker worker-20190606154533-<worker1_ip>-36704
INFO Master: Launching executor app-20190606154601-0000/1 on worker worker-20190606154535-<worker2_ip>-39447
INFO Master: Received unregister request from application app-20190606154601-0000
INFO Master: Removing app app-20190606154601-0000
INFO Master: <localMachine_ip>:65087 got disassociated, removing it.
INFO Master: <localMachine_ip>:65078 got disassociated, removing it.
WARN Master: Got status update for unknown executor app-20190606154601-0000/0
WARN Master: Got status update for unknown executor app-20190606154601-0000/1
Worker1日志:
INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/bin/java" "-cp" "/home/sparkDirectory/conf/:/home/sparkDirectory/jars/*" "-Xmx1024M" "-Dspark.driver.port=65078" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@<master_ip>:65078" "--executor-id" "0" "--hostname" "<master_ip>" "--cores" "2" "--app-id" "app-20190606154601-0000" "--worker-url" "spark://Worker@<worker_ip>:36704"
INFO Worker: Asked to kill executor app-20190606154601-0000/0
INFO ExecutorRunner: Runner thread for executor app-20190606154601-0000/0 interrupted
INFO ExecutorRunner: Killing process!
INFO Worker: Executor app-20190606154601-0000/0 finished with state KILLED exitStatus 0
INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 0
INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20190606154601-0000, execId=0)
INFO Worker: Cleaning up local directories for application app-20190606154601-0000
INFO ExternalShuffleBlockResolver: Application app-20190606154601-0000 removed, cleanupLocalDirs = true
sparkUI的stderr:
INFO FileOutputCommitter: Saved output of task 'attempt_20190606154601_0000_m_000003_0' to file:/home/mountedLocation/TextOutput
INFO SparkHadoopMapRedUtil: attempt_20190606154601_0000_m_000003_0: Committed
INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 2032 bytes result sent to driver
INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
方法1-
SparkSession ss = SparkSession.getActiveSession().get();
Dataset set = ss.read().textFile("/home/mountedLocation/TextOutput");
set.show();
方法2-
JavaSparkContext sc = new JavaSparkContext(sparkContext);
JavaRDD<String> lines = sc.textFile(files);
for(String line:lines.collect()){
System.out.println(line);
}
均返回一个空的dataSet。
观察:
有人可以帮助我了解问题出在哪里以及如何读取数据吗?欢迎所有指针和建议!