在集群上运行Spark(Scala)时读取镶木地板文件时出现问题

时间:2018-10-23 09:48:56

标签: apache-spark

希望有人可以为我们遇到的错误提供帮助。

概述:我们的集群是Datalab集群,用户将需要访问Juggernaut集群(数据源)。

我们尝试使用以下代码进行阅读:sql.Context.parquetFile(“ hdfs:// juggernaut / data / dw / usa / cem_mbb / flowCell / date = 20181009”)sqlContext.reload.load(“ hdfs:/ / juggernaut / data / dw / usa / cem_mbb / flowCell / date = 20181009“)
sqlContext.read.parquet(“ hdfs:// juggernaut / data / dw / usa / cem_mbb / flow_cell / date = 20181009”)

  1. 在本地Spark上运行-输出:确定
  2. 在Scala上运行Spark-在下面遇到了错误日志

我们继续在Spark上出现以下错误。

  

18/10/17 17:59:05警告TaskSetManager:在阶段0.0(TID中丢失了任务1.0   1,hdp007-r1.datalab.smart.local.ph):java.io.IOException:无法   读取页脚:java.io.IOException:无法读取文件的页脚   FileStatus {path = hdfs://juggernaut/data/dw/usa/cem_mbb/flowCell/date=20181009/09-part-01-usa_cem_mbb_sdr_flowCell-20181009154019-r-00000.parquet;   isDirectory = false;长度= 1070652650;复制= 0;块大小= 0;   modify_time = 0; access_time = 0;所有者=;组=;   权限= rw-rw-rw-; isSymlink = false}                 在org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)                 在org.apache.spark.sql.execution.datasources.parquet.ParquetRelation $$ anonfun $ 27.apply(ParquetRelation.scala:786)                 在org.apache.spark.sql.execution.datasources.parquet.ParquetRelation $$ anonfun $ 27.apply(ParquetRelation.scala:775)                 在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $ anonfun $ apply $ 22.apply(RDD.scala:717)                 在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $ anonfun $ apply $ 22.apply(RDD.scala:717)                 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)                 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)                 在org.apache.spark.rdd.RDD.iterator(RDD.scala:277)                 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)                 在org.apache.spark.scheduler.Task.run(Task.scala:89)                 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:227)                 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)                 在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)                 在java.lang.Thread.run(Thread.java:748)导致原因:java.io.IOException:无法读取文件的页脚   FileStatus {path = hdfs://juggernaut/data/dw/usa/cem_mbb/flowCell/date=20181009/09-part-01-usa_cem_mbb_sdr_flowCell-20181009154019-r-00000.parquet;   isDirectory = false;长度= 1070652650;复制= 0;块大小= 0;   modify_time = 0; access_time = 0;所有者=;组=;   权限= rw-rw-rw-; isSymlink = false}                 在org.apache.parquet.hadoop.ParquetFileReader $ 2.call(ParquetFileReader.java:239)                 在org.apache.parquet.hadoop.ParquetFileReader $ 2.call(ParquetFileReader.java:233)                 在java.util.concurrent.FutureTask.run(FutureTask.java:266)                 ...还有3个

0 个答案:

没有答案