我正在尝试使用Scala阅读并在spark中显示JSON文件数据。我成功地阅读了该文件,但是当我说dataframe.show()
时它会抛出一个错误。代码如下
我看到使用这种方法从spark版本2.2中读取多行JSON文件变得更容易了。
import java.sql.{Date, Timestamp}
import java.text.SimpleDateFormat
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql._
object MostTrendingVideoOnADay {
def main(args: Array[ String ]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession
.builder()
.appName("youtube")
.master("local[*]")
.getOrCreate()
val usCategory = spark.read.option("multiline", true).option("mode", "PERMISSIVE").json("G:/Apache Spark/DataSets/youtube/US_category_id.json")
usCategory.printSchema()
usCategory.show()
spark.stop()
}
}
JSON文件:
{
"kind": "youtube#videoCategoryListResponse",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJvJAAShlR6hM\"",
"items": [
{
"kind": "youtube#videoCategory",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/Xy1mB4_yLrHy_BmKmPBggty2mZQ\"",
"id": "1",
"snippet": {
"channelId": "UCBR8-60-B28hp2BmDPdntcQ",
"title": "Film & Animation",
"assignable": true
}
},
{
"kind": "youtube#videoCategory",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/UZ1oLIIz2dxIhO45ZTFR3a3NyTA\"",
"id": "2",
"snippet": {
"channelId": "UCBR8-60-B28hp2BmDPdntcQ",
"title": "Autos & Vehicles",
"assignable": true
}
}
]
}
错误:
线程中的异常" main" org.apache.spark.SparkException:Job 由于阶段失败而中止:阶段1.0中的任务0失败1次,大多数 最近的失败:在阶段1.0丢失任务0.0(TID 1,localhost,执行者 driver):java.io.FileNotFoundException:File 文件:/ G:/Apache%20Spark/DataSets/youtube/US_category_id.json没有 存在 底层文件可能已更新。您可以通过运行' REFRESH TABLE在Spark中显式地使缓存无效 表名' SQL中的命令或重新创建数据集/数据框架 参与其中。 在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.org $ apache $ spark $ sql $ execution $ datasources $ FileScanRDD $$ anon $$ readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.nextIterator(FileScanRDD.scala:174) 在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown 资源) 在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:395) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:234) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228) 在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827) 在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 在org.apache.spark.scheduler.Task.run(Task.scala:108) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748) 驱动程序堆栈跟踪: 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1517) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1505) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1504) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814) 在scala.Option.foreach(Option.scala:245) 在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) 在org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) 在org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) 在org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collectFromPlan(Dataset.scala:2861) 在org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2150) 在org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2150) 在org.apache.spark.sql.Dataset $$ anonfun $ 55.apply(Dataset.scala:2842) 在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) 在org.apache.spark.sql.Dataset.head(Dataset.scala:2150) 在org.apache.spark.sql.Dataset.take(Dataset.scala:2363) at org.apache.spark.sql.Dataset.showString(Dataset.scala:241) 在org.apache.spark.sql.Dataset.show(Dataset.scala:637) 在org.apache.spark.sql.Dataset.show(Dataset.scala:596) 在org.apache.spark.sql.Dataset.show(Dataset.scala:605) 在MostTrendingVideoOnADay $ .main(MostTrendingVideoOnADay.scala:21) 在MostTrendingVideoOnADay.main(MostTrendingVideoOnADay.scala) 引起:java.io.FileNotFoundException:文件文件:/ G:/Apache%20Spark/DataSets/youtube/US_category_id.json没有 存在 底层文件可能已更新。您可以通过运行' REFRESH TABLE在Spark中显式地使缓存无效 表名' SQL中的命令或重新创建数据集/数据框架 参与其中。 在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.org $ apache $ spark $ sql $ execution $ datasources $ FileScanRDD $$ anon $$ readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.nextIterator(FileScanRDD.scala:174) 在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown 资源) 在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:395) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:234) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228) 在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827) 在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 在org.apache.spark.scheduler.Task.run(Task.scala:108) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:2)
如您的日志文件java.io.FileNotFoundException: File file:/G:/Apache%20Spark/DataSets/youtube/US_category_id.json does not exist
您可以看到路径Apache%20Spark
中有一个空格导致问题,您是否可以删除路径中的空格?
像ApacheSpark
或Apache_Spark
这样可以解决问题。
希望这有帮助!