Spark scala数据框读取并显示多行json文件

时间:2018-02-13 16:36:30

标签: json scala apache-spark apache-spark-sql

我正在尝试使用Scala阅读并在spark中显示JSON文件数据。我成功地阅读了该文件,但是当我说dataframe.show()时它会抛出一个错误。代码如下

我看到使用这种方法从spark版本2.2中读取多行JSON文件变得更容易了。

import java.sql.{Date, Timestamp}
import java.text.SimpleDateFormat

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql._


object MostTrendingVideoOnADay {

  def main(args: Array[ String ]): Unit = {
    Logger.getLogger("org").setLevel(Level.OFF)

val spark = SparkSession
  .builder()
  .appName("youtube")
  .master("local[*]")
  .getOrCreate()

val usCategory = spark.read.option("multiline", true).option("mode", "PERMISSIVE").json("G:/Apache Spark/DataSets/youtube/US_category_id.json")
usCategory.printSchema()
usCategory.show()

spark.stop()
  }
}

JSON文件:

{
     "kind": "youtube#videoCategoryListResponse",
     "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJvJAAShlR6hM\"",
     "items": [
      {
       "kind": "youtube#videoCategory",
       "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/Xy1mB4_yLrHy_BmKmPBggty2mZQ\"",
       "id": "1",
       "snippet": {
        "channelId": "UCBR8-60-B28hp2BmDPdntcQ",
        "title": "Film & Animation",
        "assignable": true
       }
      },
      {
       "kind": "youtube#videoCategory",
       "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/UZ1oLIIz2dxIhO45ZTFR3a3NyTA\"",
       "id": "2",
       "snippet": {
        "channelId": "UCBR8-60-B28hp2BmDPdntcQ",
        "title": "Autos & Vehicles",
        "assignable": true
       }
      }
     ]
    }

错误:

  

线程中的异常" main" org.apache.spark.SparkException:Job   由于阶段失败而中止:阶段1.0中的任务0失败1次,大多数   最近的失败:在阶段1.0丢失任务0.0(TID 1,localhost,执行者   driver):java.io.FileNotFoundException:File   文件:/ G:/Apache%20Spark/DataSets/youtube/US_category_id.json没有   存在       底层文件可能已更新。您可以通过运行' REFRESH TABLE在Spark中显式地使缓存无效   表名' SQL中的命令或重新创建数据集/数据框架   参与其中。         在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.org $ apache $ spark $ sql $ execution $ datasources $ FileScanRDD $$ anon $$ readCurrentFile(FileScanRDD.scala:127)         at org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.nextIterator(FileScanRDD.scala:174)         在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.hasNext(FileScanRDD.scala:105)         at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown   资源)         在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)         在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:395)         在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:234)         在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228)         在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)         在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)         在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)         在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)         在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)         在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)         在org.apache.spark.scheduler.Task.run(Task.scala:108)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)         at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)         在java.lang.Thread.run(Thread.java:748)        驱动程序堆栈跟踪:         在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1517)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1505)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1504)         在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)         在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)         在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)         在scala.Option.foreach(Option.scala:245)         在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)         在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)         在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)         在org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)         在org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)         在org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collectFromPlan(Dataset.scala:2861)         在org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2150)         在org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2150)         在org.apache.spark.sql.Dataset $$ anonfun $ 55.apply(Dataset.scala:2842)         在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65)         at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)         在org.apache.spark.sql.Dataset.head(Dataset.scala:2150)         在org.apache.spark.sql.Dataset.take(Dataset.scala:2363)         at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)         在org.apache.spark.sql.Dataset.show(Dataset.scala:637)         在org.apache.spark.sql.Dataset.show(Dataset.scala:596)         在org.apache.spark.sql.Dataset.show(Dataset.scala:605)         在MostTrendingVideoOnADay $ .main(MostTrendingVideoOnADay.scala:21)         在MostTrendingVideoOnADay.main(MostTrendingVideoOnADay.scala)       引起:java.io.FileNotFoundException:文件文件:/ G:/Apache%20Spark/DataSets/youtube/US_category_id.json没有   存在       底层文件可能已更新。您可以通过运行' REFRESH TABLE在Spark中显式地使缓存无效   表名' SQL中的命令或重新创建数据集/数据框架   参与其中。         在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.org $ apache $ spark $ sql $ execution $ datasources $ FileScanRDD $$ anon $$ readCurrentFile(FileScanRDD.scala:127)         at org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.nextIterator(FileScanRDD.scala:174)         在org.apache.spark.sql.execution.datasources.FileScanRDD $$ anon $ 1.hasNext(FileScanRDD.scala:105)         at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown   资源)         在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)         在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:395)         在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:234)         在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228)         在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)         在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)         在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)         在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)         在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)         在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)         在org.apache.spark.scheduler.Task.run(Task.scala:108)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)         at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)         在java.lang.Thread.run(Thread.java:748)

1 个答案:

答案 0 :(得分:2)

如您的日志文件java.io.FileNotFoundException: File file:/G:/Apache%20Spark/DataSets/youtube/US_category_id.json does not exist

中所示

您可以看到路径Apache%20Spark中有一个空格导致问题,您是否可以删除路径中的空格? 像ApacheSparkApache_Spark这样可以解决问题。

希望这有帮助!