我有一个新行分隔的json文件,看起来像
{"id":1,"nested_col": {"key1": "val1", "key2": "val2", "key3": ["arr1", "arr2"]}}
{"id":2,"nested_col": {"key1": "val1_2", "key2": "val2_2", "key3": ["arr1_2", "arr2"]}}
一旦我使用df = spark.read.json(path_to_file)
读取文件,我最终会得到一个架构如下的数据框:
DataFrame[id: bigint,nested_col:struct<key1:string,key2:string,key3:array<string>>]
我想要做的是转换nested_col
以查看字符串而不将primitivesAsString
设置为true(因为我实际上有100多列,需要推断我所有其他列的类型) 。我也不知道nested_col
之前的样子。换句话说,我希望我的DataFrame
看起来像
DataFrame[id: bigint,nested_col:string]
我试着做
df.select(df['nested_col'].cast('string')).take(1)`
但它没有返回JSON的正确字符串表示形式:
[Row(nested_col=u'[0,2000000004,2800000004,3000000014,316c6176,326c6176,c00000002,3172726100000010,32727261]')]`
而我希望:
[Row(nested_col=u'{"key1": "val1", "key2": "val2", "key3": ["arr1", "arr2"]}')]
有谁知道我如何获得所需的结果(又称将嵌套的JSON字段/ StructType
转换为字符串)?
答案 0 :(得分:4)
老实说,解析JSON并推断架构只是为了将所有东西都推回JSON听起来有点奇怪但是你在这里:
必需的导入:
render partial: 'shared/avatar', locals: {:user => user}
辅助函数:
from pyspark.sql import types
from pyspark.sql.functions import to_json, concat_ws, concat, struct
示例用法:
def jsonify(df):
def convert(f):
if isinstance(f.dataType, types.StructType):
return to_json(f.name).alias(f.name)
if isinstance(f.dataType, types.ArrayType):
return get_json_object(
to_json(struct(f.name)),
"$.{0}".format(f.name)
).alias(f.name)
return f.name
return df.select([convert(f) for f in df.schema.fields])
df = sc.parallelize([("a", 1, (2, 3), ["1", "2", "3"])]).toDF()
jsonify(df).show()