Spark -scala从json文件创建DataFrame并删除基本和嵌套元素

时间:2018-10-30 11:42:21

标签: json scala apache-spark

我有一个包含json对象的json文件,每个对象逐行显示。 我有以下对象的架构:

root
   |-- endtime: long (nullable = true)
   |-- result: array (nullable = true)
   |    |-- element: struct (containsNull = true)
   |    |    |-- hop: long (nullable = true)
   |    |    |-- result: array (nullable = true)
   |    |    |    |-- element: struct (containsNull = true)
   |    |    |    |    |-- from: string (nullable = true)
   |    |    |    |    |-- rtt: double (nullable = true)
   |    |    |    |    |-- size: long (nullable = true)
   |    |    |    |    |-- ttl: long (nullable = true)
   |    |    |    |    |-- x: string (nullable = true)

问题:如何从包含以输入形式提供的json文件中的数据的数据框创建新的数据框,并删除ttl和x的数据?

   |    |    |    |    |-- ttl: long (nullable = true)
   |    |    |    |    |-- x: string (nullable = true)

鉴于我是Spark(Scala)的新手,我不知道有什么可能的方式!

删除结束时间很简单:

val pathToTraceroutesExamples = getClass.getResource("/test/sample_1.json")
val df = spark.read.json(pathToTraceroutesExamples.getPath)

// Displays the content of the DataFrame to stdout
df.show()
df.printSchema()

var newDf = df.drop("endtime")

2 个答案:

答案 0 :(得分:1)

explodedrop可以解决问题。首先,根据结果数据帧explode进行第一级结果,然后explode进行第二级结果。最后drop列。

例如,

val newDF = df
  .select(df(“*”), explode(df(“result”)).alias(“result_exp”))
  .drop(“ttl”).drop(“x”)

答案 1 :(得分:0)

@Kris的想法是正确的;爆炸然后掉落。我找到了一个示例here

我更改了属性名称结果,因为我有另一个结果名称以避免爆炸时的混乱:

步骤1 :(输入)

 |-- timestamp: long (nullable = true)
 |-- hopDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- hop: long (nullable = true)
 |    |    |-- result: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- from: string (nullable = true)
 |    |    |    |    |-- rtt: double (nullable = true)
 |    |    |    |    |-- size: long (nullable = true)
 |    |    |    |    |-- ttl: long (nullable = true)

步骤2: 代码:

    var exploded_1 = renamed_newDF
             .withColumn("hop", explode(renamed_newDF("hopDetails.hop")))
             .withColumn("result", explode(renamed_newDF("hopDetails.result")))
             .drop("hopDetails")
    exploded_1.printSchema

输出架构:

 |-- timestamp: long (nullable = true)
 |-- hop: long (nullable = true)
 |-- result: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- from: string (nullable = true)
 |    |    |-- rtt: double (nullable = true)
 |    |    |-- size: long (nullable = true)
 |    |    |-- ttl: long (nullable = true)

第3步:

代码:

var exploded_2 = exploded_1
  .withColumn("from", explode(exploded_1("result.from")))
  .withColumn("rtt", explode(exploded_1("result.rtt")))
  .withColumn("size", explode(exploded_1("result.size")))
  .withColumn("ttl", explode(exploded_1("result.ttl")))
  .drop("result")

exploded_2.printSchema

模式:

    root
   |-- af: long (nullable = true)
   |-- dst_addr: string (nullable = true)
   |-- from: string (nullable = true)
   |-- msm_id: long (nullable = true)
   |-- prb_id: long (nullable = true)
   |-- src_addr: string (nullable = true)
   |-- timestamp: long (nullable = true)
   |-- hop: long (nullable = true)
   |-- rtt: double (nullable = true)
   |-- size: long (nullable = true)
   |-- ttl: long (nullable = true)