我有一个包含以下内容的文件:
a {"field1":{"field2":"val","field3":"val"...}}
b {"field1":{"field2":"val","field3":"val"...}}
...
我可以将文件加载到这样的表中:
╔════╦════════════════════════════════════════════════
║ ID ║ JSON ║
╠════╬════════════════════════════════════════════════
║ a ║ {"field1":{"field2":"val","field3":"val"...}} ║
║ b ║ {"field1":{"field2":"val","field3":"val"...}} ║
╚════╩════════════════════════════════════════════════
我怎样才能把它变成这样的东西?
╔════╦═════════════════════════════════════
║ ID ║ field2 ║field3 ║... ║... ║
╠════╬═════════════════════════════════════
║ a ║ val ║val ║.. ║... ║
║ b ║ val ║val ║.. ║... ║
╚════╩═════════════════════════════════════
由于它是部分json文件,我不能read.json
我也看过这篇文章convert lines of json in RDD to dataframe in apache Spark
但我的json字符串是一个嵌套的json,它很长,所以我不想列出所有的字段。
我也试过
#solr_data is the data frame made from the file, and json is the column with the json string, session is a SparkSession
json_table = solr_data.select(solr_data["json"]).rdd.map(lambda x:session.read.json(x))
这不太好用。我不能show()
或collect()
,createDataFrame()
也不适合。
答案 0 :(得分:0)
使用select("JSON.field1.*")
来" destructure" sub-JSON to columns。