给出具有以下格式的数据框:
{
"field1": "value1",
"field2": "value2",
"elements": [{
"id": "1",
"name": "a"
},
{
"id": "2",
"name": "b"
},
{
"id": "3",
"name": "c"
}]
}
我们可以像这样平整列:
val exploded = df.withColumn("elements", explode($"elements"))
exploded.show()
>> +--------+------+------+
>> |elements|field1|field2|
>> +--------+------+------+
>> | [1,a]|value1|value2|
>> | [2,b]|value1|value2|
>> | [3,c]|value1|value2|
>> +--------+------+------+
val flattened = exploded.select("elements.*", "field1", "field2")
flattened.show()
>> +---+----+------+------+
>> | id|name|field1|field2|
>> +---+----+------+------+
>> | 1| a|value1|value2|
>> | 2| b|value1|value2|
>> | 3| c|value1|value2|
>> +---+----+------+------+
是否有一种无需明确指定其余列即可获取展平数据帧的方法?这样的东西(尽管这不起作用)?
val flattened = exploded.select("elements.*", "*")
答案 0 :(得分:1)
是的,您可以查询exploded
的列,然后选择除elements
之外的所有列:
import org.apache.spark.sql.functions.col
val colsToSelect = exploded.columns.filterNot(c => c == "elements").map(col)
val flattened = exploded.select(($"elements.*" +:colsToSelect):_*)