数据可以在这里找到,它是我在github上找到的相对较小的json文件。我正在尝试学习将其解析为可以分析的数据帧的最佳方法(每集1行)。
http://api.tvmaze.com/singlesearch/shows?q=black-mirror&embed=episodes
在调用spark.read.json()之后读取后,下面给出了数据帧的架构:
# File location and type
file_location = "/FileStore/tables/blackmirror.json"
file_type = "json"
# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
df.printSchema()
|-- _embedded: struct (nullable = true)
| |-- episodes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- _links: struct (nullable = true)
| | | | |-- self: struct (nullable = true)
| | | | | |-- href: string (nullable = true)
| | | |-- airdate: string (nullable = true)
| | | |-- airstamp: string (nullable = true)
| | | |-- airtime: string (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- image: struct (nullable = true)
| | | | |-- medium: string (nullable = true)
| | | | |-- original: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- number: long (nullable = true)
| | | |-- runtime: long (nullable = true)
| | | |-- season: long (nullable = true)
| | | |-- summary: string (nullable = true)
| | | |-- url: string (nullable = true)
|-- _links: struct (nullable = true)
| |-- previousepisode: struct (nullable = true)
| | |-- href: string (nullable = true)
| |-- self: struct (nullable = true)
| | |-- href: string (nullable = true)
|-- externals: struct (nullable = true)
| |-- imdb: string (nullable = true)
| |-- thetvdb: long (nullable = true)
| |-- tvrage: long (nullable = true)
|-- genres: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- image: struct (nullable = true)
| |-- medium: string (nullable = true)
| |-- original: string (nullable = true)
|-- language: string (nullable = true)
|-- name: string (nullable = true)
|-- network: string (nullable = true)
|-- officialSite: string (nullable = true)
|-- premiered: string (nullable = true)
|-- rating: struct (nullable = true)
| |-- average: double (nullable = true)
|-- runtime: long (nullable = true)
|-- schedule: struct (nullable = true)
| |-- days: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- time: string (nullable = true)
|-- status: string (nullable = true)
|-- summary: string (nullable = true)
|-- type: string (nullable = true)
|-- updated: long (nullable = true)
|-- url: string (nullable = true)
|-- webChannel: struct (nullable = true)
| |-- country: string (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
|-- weight: long (nullable = true)
当我调用df.count()时,返回单个元素。我想要每个元素1行。看一些类似的答案,认为我可以使用sql.functions explode()将数组转换为数据框,但想知道做到这一点的最佳方法而又不会丢失信息。