如何将部分JSON文件加载到DataFrame?

时间:2017-07-20 15:40:19

标签: python apache-spark pyspark apache-spark-sql

我有一个包含以下内容的文件:

a {"field1":{"field2":"val","field3":"val"...}}
b {"field1":{"field2":"val","field3":"val"...}}
...

我可以将文件加载到这样的表中:

╔════╦════════════════════════════════════════════════
║ ID ║  JSON                                         ║
╠════╬════════════════════════════════════════════════
║  a ║ {"field1":{"field2":"val","field3":"val"...}} ║
║  b ║ {"field1":{"field2":"val","field3":"val"...}} ║
╚════╩════════════════════════════════════════════════

我怎样才能把它变成这样的东西?

╔════╦═════════════════════════════════════
║ ID ║ field2  ║field3 ║...     ║...     ║
╠════╬═════════════════════════════════════
║  a ║ val     ║val    ║..      ║...     ║
║  b ║ val     ║val    ║..      ║...     ║
╚════╩═════════════════════════════════════

由于它是部分json文件,我不能read.json 我也看过这篇文章convert lines of json in RDD to dataframe in apache Spark 但我的json字符串是一个嵌套的json,它很长,所以我不想列出所有的字段。 我也试过

#solr_data is the data frame made from the file, and json is the column with the json string, session is a SparkSession
json_table = solr_data.select(solr_data["json"]).rdd.map(lambda x:session.read.json(x))

这不太好用。我不能show()collect()createDataFrame()也不适合。

1 个答案:

答案 0 :(得分:0)

使用select("JSON.field1.*")来" destructure" sub-JSON to columns。