将 json 文件导入 pyspark 数据帧

时间:2021-05-17 11:08:41

标签: apache-spark pyspark apache-spark-sql jupyter-notebook jupyter

我已经下载了一个 json 文件,我正在尝试将其放入 DataFrame 中,以便进行一些分析。

raw_constructors = spark.read.json("/constructors.json")

当我生成 raw_constructors.show() 时,我只得到一列和一行。

+--------------------+
|              MRData|
+--------------------+
|{{[{adams, Adams,...|
+--------------------+

所以当我用 raw_constructors.printSchema()

询问 json 文件的架构时

我明白了:

root
 |-- MRData: struct (nullable = true)
 |    |-- ConstructorTable: struct (nullable = true)
 |    |    |-- Constructors: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- constructorId: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- nationality: string (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |    |-- limit: string (nullable = true)
 |    |-- offset: string (nullable = true)
 |    |-- series: string (nullable = true)
 |    |-- total: string (nullable = true)
 |    |-- url: string (nullable = true)
 |    |-- xmlns: string (nullable = true)

我正在使用 pyspark。

如何获取包含 4 列的 dataFrame:constructorId、name、nationality、url 并为每个项目获取一行?

谢谢!

1 个答案:

答案 0 :(得分:0)

您可以简单地使用 explode 将数组分解为多行

from pyspark.sql import functions as F

(df
    .select(F.explode('MRData.ConstructorTable.Constructors').alias('tmp'))
    .select('tmp.*')
    .show()
)

+-------------+----+-----------+---+
|constructorId|name|nationality|url|
+-------------+----+-----------+---+
|           i1|  n1|         y1| u1|
|           i2|  n2|         y2| u2|
+-------------+----+-----------+---+