Question

我正在尝试使用Scala阅读并在spark中显示JSON文件数据。我成功地阅读了该文件，但是当我说dataframe.show()时它会抛出一个错误。代码如下

我看到使用这种方法从spark版本2.2中读取多行JSON文件变得更容易了。

import java.sql.{Date, Timestamp}
import java.text.SimpleDateFormat

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql._


object MostTrendingVideoOnADay {

  def main(args: Array[ String ]): Unit = {
    Logger.getLogger("org").setLevel(Level.OFF)

val spark = SparkSession
  .builder()
  .appName("youtube")
  .master("local[*]")
  .getOrCreate()

val usCategory = spark.read.option("multiline", true).option("mode", "PERMISSIVE").json("G:/Apache Spark/DataSets/youtube/US_category_id.json")
usCategory.printSchema()
usCategory.show()

spark.stop()
  }
}

JSON文件：

{
     "kind": "youtube#videoCategoryListResponse",
     "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJvJAAShlR6hM\"",
     "items": [
      {
       "kind": "youtube#videoCategory",
       "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/Xy1mB4_yLrHy_BmKmPBggty2mZQ\"",
       "id": "1",
       "snippet": {
        "channelId": "UCBR8-60-B28hp2BmDPdntcQ",
        "title": "Film & Animation",
        "assignable": true
       }
      },
      {
       "kind": "youtube#videoCategory",
       "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/UZ1oLIIz2dxIhO45ZTFR3a3NyTA\"",
       "id": "2",
       "snippet": {
        "channelId": "UCBR8-60-B28hp2BmDPdntcQ",
        "title": "Autos & Vehicles",
        "assignable": true
       }
      }
     ]
    }

错误：

线程中的异常＆＃34; main＆＃34; org.apache.spark.SparkException：Job 由于阶段失败而中止：阶段1.0中的任务0失败1次，大多数最近的失败：在阶段1.0丢失任务0.0（TID 1，localhost，执行者 driver）：java.io.FileNotFoundException：File 文件：/ G：/Apache%20Spark/DataSets/youtube/US_category_id.json没有存在底层文件可能已更新。您可以通过运行＆＃39; REFRESH TABLE在Spark中显式地使缓存无效表名＆＃39; SQL中的命令或重新创建数据集/数据框架参与其中。在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.org $ apache $ spark $ sql $ execution $ datasources $ FileScanRDD $$ anon $$ readCurrentFile（FileScanRDD.scala：127） at org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.nextIterator（FileScanRDD.scala：174）在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.hasNext（FileScanRDD.scala：105） at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext（Unknown 资源）在org.apache.spark.sql.execution.BufferedRowIterator.hasNext（BufferedRowIterator.java:43）在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext（WholeStageCodegenExec.scala：395）在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply（SparkPlan.scala：234）在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply（SparkPlan.scala：228）在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply（RDD.scala：827）在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply（RDD.scala：827）在org.apache.spark.rdd.MapPartitionsRDD.compute（MapPartitionsRDD.scala：38）在org.apache.spark.rdd.RDD.computeOrReadCheckpoint（RDD.scala：323）在org.apache.spark.rdd.RDD.iterator（RDD.scala：287）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：87）在org.apache.spark.scheduler.Task.run（Task.scala：108）在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：338）在java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1149） at java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:624）在java.lang.Thread.run（Thread.java:748）驱动程序堆栈跟踪：在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages（DAGScheduler.scala：1517）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1505）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1504）在scala.collection.mutable.ResizableArray $ class.foreach（ResizableArray.scala：59）在scala.collection.mutable.ArrayBuffer.foreach（ArrayBuffer.scala：48）在org.apache.spark.scheduler.DAGScheduler.abortStage（DAGScheduler.scala：1504）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：814）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：814）在scala.Option.foreach（Option.scala：245）在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed（DAGScheduler.scala：814）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive（DAGScheduler.scala：1732）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1687）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1676）在org.apache.spark.util.EventLoop $$ anon $ 1.run（EventLoop.scala：48）在org.apache.spark.scheduler.DAGScheduler.runJob（DAGScheduler.scala：630）在org.apache.spark.SparkContext.runJob（SparkContext.scala：2029）在org.apache.spark.SparkContext.runJob（SparkContext.scala：2050）在org.apache.spark.SparkContext.runJob（SparkContext.scala：2069）在org.apache.spark.sql.execution.SparkPlan.executeTake（SparkPlan.scala：336）在org.apache.spark.sql.execution.CollectLimitExec.executeCollect（limit.scala：38）在org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collectFromPlan（Dataset.scala：2861）在org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply（Dataset.scala：2150）在org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply（Dataset.scala：2150）在org.apache.spark.sql.Dataset $$ anonfun $ 55.apply（Dataset.scala：2842）在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId（SQLExecution.scala：65） at org.apache.spark.sql.Dataset.withAction（Dataset.scala：2841）在org.apache.spark.sql.Dataset.head（Dataset.scala：2150）在org.apache.spark.sql.Dataset.take（Dataset.scala：2363） at org.apache.spark.sql.Dataset.showString（Dataset.scala：241）在org.apache.spark.sql.Dataset.show（Dataset.scala：637）在org.apache.spark.sql.Dataset.show（Dataset.scala：596）在org.apache.spark.sql.Dataset.show（Dataset.scala：605）在MostTrendingVideoOnADay $ .main（MostTrendingVideoOnADay.scala：21）在MostTrendingVideoOnADay.main（MostTrendingVideoOnADay.scala）引起：java.io.FileNotFoundException：文件文件：/ G：/Apache%20Spark/DataSets/youtube/US_category_id.json没有存在底层文件可能已更新。您可以通过运行＆＃39; REFRESH TABLE在Spark中显式地使缓存无效表名＆＃39; SQL中的命令或重新创建数据集/数据框架参与其中。在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.org $ apache $ spark $ sql $ execution $ datasources $ FileScanRDD $$ anon $$ readCurrentFile（FileScanRDD.scala：127） at org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.nextIterator（FileScanRDD.scala：174）在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.hasNext（FileScanRDD.scala：105） at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext（Unknown 资源）在org.apache.spark.sql.execution.BufferedRowIterator.hasNext（BufferedRowIterator.java:43）在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext（WholeStageCodegenExec.scala：395）在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply（SparkPlan.scala：234）在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply（SparkPlan.scala：228）在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply（RDD.scala：827）在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply（RDD.scala：827）在org.apache.spark.rdd.MapPartitionsRDD.compute（MapPartitionsRDD.scala：38）在org.apache.spark.rdd.RDD.computeOrReadCheckpoint（RDD.scala：323）在org.apache.spark.rdd.RDD.iterator（RDD.scala：287）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：87）在org.apache.spark.scheduler.Task.run（Task.scala：108）在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：338）在java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1149） at java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:624）在java.lang.Thread.run（Thread.java:748）

Answer 1

如您的日志文件java.io.FileNotFoundException: File file:/G:/Apache%20Spark/DataSets/youtube/US_category_id.json does not exist

中所示

您可以看到路径Apache%20Spark中有一个空格导致问题，您是否可以删除路径中的空格？像ApacheSpark或Apache_Spark这样可以解决问题。

希望这有帮助！

Spark scala数据框读取并显示多行json文件

1 个答案: