我正在使用代码部分中提到的代码将加载到 Spark 数据帧的以下 json 展平
<plugin>
<groupId>org.jacoco</groupId>
<artifactId>jacoco-maven-plugin</artifactId>
<configuration>
<includes>
<include>**/*.class</include>
</includes>
</configuration>
</plugin>
代码
{
"id":"B07H3MVTSN",
"mid":4,
"inner":{
"type1":[{
"cid":"B06XVVSLX8"
},
{
"cid":"B06XJ2JZ2Z"
}
]
}
}
它产生以下输出
输出:
df
.withColumn("cid", org.apache.spark.sql.functions.explode(df.col("inner.type1")).as("cid"))
.drop("inner").show;
explode 函数是在 cid 列的每个元素中添加 []。我只想要 [] 里面的字符串。如何删除 [] ?
如果我尝试打印模式,它会显示列 cid,即结构。
+----------+--------------+------------+
| id |mid | cid|
+----------+--------------+------------+
|B07H3MVTSN| 4|[B06XVVSLX8]|
|B07H3MVTSN| 4|[B06XJ2JZ2Z]|
+----------+--------------+------------+
如何将值从结构体转换为字符串,以便架构为
root
|-- id: string (nullable = true)
|-- mid: long (nullable = true)
|-- cid: struct (nullable = true)
| |-- cid: string (nullable = true)
答案 0 :(得分:2)
检查下面的代码。
df
.withColumn("cid",explode($"inner.type1.cid"))
.drop("inner")
.show(false)
+----------+---+----------+
|id |mid|cid |
+----------+---+----------+
|B07H3MVTSN|4 |B06XVVSLX8|
|B07H3MVTSN|4 |B06XJ2JZ2Z|
+----------+---+----------+