我有一个带有Schema的数据集,如下所示
root
|-- collectorId: string (nullable = true)
|-- generatedAt: long (nullable = true)
|-- managedNeId: string (nullable = true)
|-- neAlert: struct (nullable = true)
| |-- advisory: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- equipmentType: string (nullable = true)
| | | |-- headlineName: string (nullable = true)
| |-- fieldNotice: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- caveat: string (nullable = true)
| | | |-- distributionCode: string (nullable = true)
| |-- hwEoX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bulletinName: string (nullable = true)
| | | |-- equipmentType: string (nullable = true)
| |-- swEoX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bulletinHeadline: string (nullable = true)
| | | |-- equipmentType: string (nullable = true)
|-- partyId: string (nullable = true)
|-- recordType: string (nullable = true)
|-- sourceNeId: string (nullable = true)
|-- sourcePartyId: string (nullable = true)
|-- sourceSubPartyId: string (nullable = true)
|-- wfid: string (nullable = true)
我想获取"元素"中的字段。为了做到这一点,我在阵列上做了一个爆炸,以平整这个。
Dataset<Row> alert = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("C:\\Users\\LearningAndDevelopment\\\\merge\\data1\\sample.json");
Seq<String> droppedColumns = scala.collection.JavaConversions.asScalaBuffer(Arrays.asList("neAlert"));
Dataset<Row> alertjson = alert.withColumn("exploded_advisory", explode(col("neAlert.advisory"))).withColumn("exploded_fn", explode(col("neAlert.fieldNotice"))).withColumn("exploded_swEoX", explode(col("neAlert.swEoX"))).withColumn("exploded_hwEox", explode(col("neAlert.hwEoX"))).drop(droppedColumns);
alertjson.printSchema();
我得到了最终的JSON,如下所示
root
|-- collectorId: string (nullable = true)
|-- generatedAt: long (nullable = true)
|-- managedNeId: string (nullable = true)
|-- partyId: string (nullable = true)
|-- recordType: string (nullable = true)
|-- sourceNeId: string (nullable = true)
|-- sourcePartyId: string (nullable = true)
|-- sourceSubPartyId: string (nullable = true)
|-- wfid: string (nullable = true)
|-- exploded_advisory: struct (nullable = true)
| |-- equipmentType: string (nullable = true)
| |-- headlineName: string (nullable = true)
|-- exploded_fn: struct (nullable = true)
| |-- caveat: string (nullable = true)
| |-- distributionCode: string (nullable = true)
|-- exploded_swEoX: struct (nullable = true)
| |-- bulletinHeadline: string (nullable = true)
| |-- equipmentType: string (nullable = true)
|-- exploded_hwEox: struct (nullable = true)
| |-- bulletinName: string (nullable = true)
| |-- equipmentType: string (nullable = true)
但是,上面的方法创建了在每个JSON数组的第一个元素中用数据展平的所有重复记录。每个数组都可以包含这么多元素。如何在不丢失数据完整性的情况下展平JSON数组。
答案 0 :(得分:1)
您可以先选择带json
点运算符的嵌套.
,然后对每个嵌套字段使用explode
。
Dataset<Row> alertjson = alert
.withColumn("exploded_advisory", explode(col("neAlert.advisory")))
.withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))
.withColumn("exploded_swEoX", explode(col("neAlert.swEoX")))
.withColumn("exploded_hwEox", explode(col("neAlert.hwEoX")));
如果您希望每个字段explode
都是个体,那么您必须单独爆炸,这会创建多个数据框
// for advisory
Dataset<Row> alertjson = alert
.withColumn("exploded_advisory", explode(col("neAlert.advisory")))
DataSet<Row> fieldNorice = alert
.withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))
删除不需要的列,应该可以正常工作。