Question

我目前正在使用Apache Spark 2.1.1将XML文件处理为CSV。我的目标是扁平化XML，但我目前面临的问题是无限制的元素出现。 Spark自动将这些无界事件推断为数组。现在我要做的是爆炸一个数组列。

 Sample Schema

 |-- Instrument_XREF_Identifier: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- @bsid: string (nullable = true)
 |    |    |-- @exch_code: string (nullable = true)
 |    |    |-- @id_bb_sec_num: string (nullable = true)
 |    |    |-- @market_sector: string (nullable = true)

我知道我可以通过这种方法爆炸数组

result = result.withColumn(p.name, explode(col(p.name)))

将产生多行，每个数组值包含struct。但我想要产生的输出是将它分解为多列而不是行。

根据我上面提到的架构，这是我的预期输出：

假设数组中有两个struct值。

bsid1   exch_code1   id_bb_sec_num1   market_sector1   bsid2   exch_code2   id_bb_sec_num2   market_sector2
123     3            1                13               234     12           212              221

Answer 1

假设Instrument_XREF_Identifier是array<struct<..>>类型的列，那么您必须分两步完成：

result
.withColumn("tmp",explode(col("Instrument_XREF_Identifier")))
.select("tmp.*")

这将为每个结构元素提供一列。

似乎没有办法在1 select / withColumn语句中执行此操作，请参阅Explode array of structs to columns in Spark

Apache Spark数据帧列会爆炸到多个列

1 个答案: