使用火花爆炸时删除方括号

时间:2021-07-03 01:23:39

标签: apache-spark apache-spark-sql

我正在使用代码部分中提到的代码将加载到 Spark 数据帧的以下 json 展平

<plugin>
    <groupId>org.jacoco</groupId>
    <artifactId>jacoco-maven-plugin</artifactId>
    <configuration>
      <includes>
        <include>**/*.class</include>
      </includes>
    </configuration>
</plugin>

代码

{
    "id":"B07H3MVTSN",
    "mid":4,
    "inner":{
      "type1":[{
          "cid":"B06XVVSLX8"
        },
        {
          "cid":"B06XJ2JZ2Z"
        }
      ]
    }
  }

它产生以下输出

输出:

df
.withColumn("cid", org.apache.spark.sql.functions.explode(df.col("inner.type1")).as("cid"))
.drop("inner").show;

explode 函数是在 cid 列的每个元素中添加 []。我只想要 [] 里面的字符串。如何删除 [] ?

如果我尝试打印模式,它会显示列 cid,即结构。

+----------+--------------+------------+
|   id     |mid           |         cid|
+----------+--------------+------------+
|B07H3MVTSN|             4|[B06XVVSLX8]|
|B07H3MVTSN|             4|[B06XJ2JZ2Z]|
+----------+--------------+------------+

如何将值从结构体转换为字符串,以便架构为

root
 |-- id: string (nullable = true)
 |-- mid: long (nullable = true)
 |-- cid: struct (nullable = true)
 |    |-- cid: string (nullable = true)

1 个答案:

答案 0 :(得分:2)

检查下面的代码。

df
.withColumn("cid",explode($"inner.type1.cid"))
.drop("inner")
.show(false)
+----------+---+----------+
|id        |mid|cid       |
+----------+---+----------+
|B07H3MVTSN|4  |B06XVVSLX8|
|B07H3MVTSN|4  |B06XJ2JZ2Z|
+----------+---+----------+