使用PySpark将嵌套的JSON文件解析为表格数据框(每集1行)?

时间:2019-02-27 20:37:51

标签: json pyspark pyspark-sql

数据可以在这里找到,它是我在github上找到的相对较小的json文件。我正在尝试学习将其解析为可以分析的数据帧的最佳方法(每集1行)。

http://api.tvmaze.com/singlesearch/shows?q=black-mirror&embed=episodes

在调用spark.read.json()之后读取后,下面给出了数据帧的架构:

# File location and type
file_location = "/FileStore/tables/blackmirror.json"
file_type = "json"

# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

df.printSchema()

|-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: string (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = true)
 |-- _links: struct (nullable = true)
 |    |-- previousepisode: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |    |-- self: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |-- externals: struct (nullable = true)
 |    |-- imdb: string (nullable = true)
 |    |-- thetvdb: long (nullable = true)
 |    |-- tvrage: long (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- id: long (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- medium: string (nullable = true)
 |    |-- original: string (nullable = true)
 |-- language: string (nullable = true)
 |-- name: string (nullable = true)
 |-- network: string (nullable = true)
 |-- officialSite: string (nullable = true)
 |-- premiered: string (nullable = true)
 |-- rating: struct (nullable = true)
 |    |-- average: double (nullable = true)
 |-- runtime: long (nullable = true)
 |-- schedule: struct (nullable = true)
 |    |-- days: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- time: string (nullable = true)
 |-- status: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- type: string (nullable = true)
 |-- updated: long (nullable = true)
 |-- url: string (nullable = true)
 |-- webChannel: struct (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |-- weight: long (nullable = true)

当我调用df.count()时,返回单个元素。我想要每个元素1行。看一些类似的答案,认为我可以使用sql.functions explode()将数组转换为数据框,但想知道做到这一点的最佳方法而又不会丢失信息。

0 个答案:

没有答案