我有一个包含json对象的json文件,每个对象逐行显示。 我有以下对象的架构:
root
|-- endtime: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- hop: long (nullable = true)
| | |-- result: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- from: string (nullable = true)
| | | | |-- rtt: double (nullable = true)
| | | | |-- size: long (nullable = true)
| | | | |-- ttl: long (nullable = true)
| | | | |-- x: string (nullable = true)
问题:如何从包含以输入形式提供的json文件中的数据的数据框创建新的数据框,并删除ttl和x的数据?
| | | | |-- ttl: long (nullable = true)
| | | | |-- x: string (nullable = true)
鉴于我是Spark(Scala)的新手,我不知道有什么可能的方式!
删除结束时间很简单:
val pathToTraceroutesExamples = getClass.getResource("/test/sample_1.json")
val df = spark.read.json(pathToTraceroutesExamples.getPath)
// Displays the content of the DataFrame to stdout
df.show()
df.printSchema()
var newDf = df.drop("endtime")
答案 0 :(得分:1)
explode
和drop
可以解决问题。首先,根据结果数据帧explode
进行第一级结果,然后explode
进行第二级结果。最后drop
列。
例如,
val newDF = df
.select(df(“*”), explode(df(“result”)).alias(“result_exp”))
.drop(“ttl”).drop(“x”)
答案 1 :(得分:0)
@Kris的想法是正确的;爆炸然后掉落。我找到了一个示例here。
我更改了属性名称结果,因为我有另一个结果名称以避免爆炸时的混乱:
步骤1 :(输入)
|-- timestamp: long (nullable = true)
|-- hopDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- hop: long (nullable = true)
| | |-- result: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- from: string (nullable = true)
| | | | |-- rtt: double (nullable = true)
| | | | |-- size: long (nullable = true)
| | | | |-- ttl: long (nullable = true)
步骤2: 代码:
var exploded_1 = renamed_newDF
.withColumn("hop", explode(renamed_newDF("hopDetails.hop")))
.withColumn("result", explode(renamed_newDF("hopDetails.result")))
.drop("hopDetails")
exploded_1.printSchema
输出架构:
|-- timestamp: long (nullable = true)
|-- hop: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- from: string (nullable = true)
| | |-- rtt: double (nullable = true)
| | |-- size: long (nullable = true)
| | |-- ttl: long (nullable = true)
第3步:
代码:
var exploded_2 = exploded_1
.withColumn("from", explode(exploded_1("result.from")))
.withColumn("rtt", explode(exploded_1("result.rtt")))
.withColumn("size", explode(exploded_1("result.size")))
.withColumn("ttl", explode(exploded_1("result.ttl")))
.drop("result")
exploded_2.printSchema
模式:
root
|-- af: long (nullable = true)
|-- dst_addr: string (nullable = true)
|-- from: string (nullable = true)
|-- msm_id: long (nullable = true)
|-- prb_id: long (nullable = true)
|-- src_addr: string (nullable = true)
|-- timestamp: long (nullable = true)
|-- hop: long (nullable = true)
|-- rtt: double (nullable = true)
|-- size: long (nullable = true)
|-- ttl: long (nullable = true)