Question

如何为json数组定义架构，以便将其分解为行？

我有一个UDF，它返回一个字符串（json数组），我想将数组中的项目分解成行，然后保存。

样本json

[{"name":"name1","id":"1"}, {"name":"name2","id":"2"}]

我将模式定义为

// schema of each item in the array
StructType arrayItemDataType= new StructType(new structfiled("name"), new structfiled(id));
var rootType = new ArrayType(itemDataType); // array of items

var field = new StructField(name: "MyJson", dataType: rootType, isNullable: false);
StructType schema = new StructType(new structField(field));


df.select(from_json('mycol', schema))

爆炸tt之后创建类似

的架构

root
 |-- col: struct (nullable = true)
 |    |-- name: string
 |    |-- id: string

我正在本地群集上运行并写入csv，我期望爆炸后它的数据帧应具有2列名称和id，并且我可以在csv中写入所有行。当我运行时，它没有以名称，id的形式创建df架构，并且无法通过消息“ csv不支持struct <”

写入csv

Answer 1

Sample DataFrame:

%python

list=[['[{"name":"name1","id":"1"}, {"name":"name2","id":"2"}]']]
df=spark.createDataFrame(list,['mycol'])
df.show(truncate=False)

#df:pyspark.sql.dataframe.DataFrame = [mycol: string]
#+------------------------------------------------------+
#|mycol                                                 |
#+------------------------------------------------------+
#|[{"name":"name1","id":"1"}, {"name":"name2","id":"2"}]|
#+------------------------------------------------------+

您可以使用 all fields 和选择 struct 中的 name of column .*

from pyspark.sql.types import *
from pyspark.sql import functions as F
json_schema=ArrayType(StructType([ StructField("name", StringType()), StructField("id", StringType())]))

df.withColumn("json",F.explode(F.from_json("mycol",json_schema)))\
  .select("json.*").show()

#+-----+---+
#| name| id|
#+-----+---+
#|name1|  1|
#|name2|  2|
#+-----+---+

带有json数组的spark爆炸列到行

1 个答案: