我已经下载了一个 json 文件,我正在尝试将其放入 DataFrame 中,以便进行一些分析。
raw_constructors = spark.read.json("/constructors.json")
当我生成 raw_constructors.show()
时,我只得到一列和一行。
+--------------------+
| MRData|
+--------------------+
|{{[{adams, Adams,...|
+--------------------+
所以当我用 raw_constructors.printSchema()
我明白了:
root
|-- MRData: struct (nullable = true)
| |-- ConstructorTable: struct (nullable = true)
| | |-- Constructors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- constructorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- nationality: string (nullable = true)
| | | | |-- url: string (nullable = true)
| |-- limit: string (nullable = true)
| |-- offset: string (nullable = true)
| |-- series: string (nullable = true)
| |-- total: string (nullable = true)
| |-- url: string (nullable = true)
| |-- xmlns: string (nullable = true)
我正在使用 pyspark。
如何获取包含 4 列的 dataFrame:constructorId、name、nationality、url 并为每个项目获取一行?
谢谢!
答案 0 :(得分:0)
您可以简单地使用 explode
将数组分解为多行
from pyspark.sql import functions as F
(df
.select(F.explode('MRData.ConstructorTable.Constructors').alias('tmp'))
.select('tmp.*')
.show()
)
+-------------+----+-----------+---+
|constructorId|name|nationality|url|
+-------------+----+-----------+---+
| i1| n1| y1| u1|
| i2| n2| y2| u2|
+-------------+----+-----------+---+