我有一个JSON文件,该文件中嵌套了JSON,并且要读取嵌套的JSON,我想使用pyspark的explode函数。由于我是新手,所以我尝试使用explode而不创建数据框,但无法获取正确的语法d =如何使用爆炸功能,这是正确的方法还是我们必须先创建数据帧,然后才可以使用爆炸功能。我在stackoverflow上读了很少的答案,但无法得到我的答案。谢谢您能简单地向我解释。 预先感谢
答案 0 :(得分:-1)
您可以通过此代码
source_json = """
{
"persons": [
{
"name": "John",
"age": 30,
"cars": [
{
"name": "Ford",
"models": [
"Fiesta",
"Focus",
"Mustang"
]
},
{
"name": "BMW",
"models": [
"320",
"X3",
"X5"
]
}
]
},
{
"name": "Peter",
"age": 46,
"cars": [
{
"name": "Huyndai",
"models": [
"i10",
"i30"
]
},
{
"name": "Mercedes",
"models": [
"E320",
"E63 AMG"
]
}
]
}
]
}
"""
from pyspark.sql.functions import explode, col
dbutils.fs.put("/tmp/source.json", source_json, True)
source_df = spark.read.option("multiline", "true").json("/tmp/source.json")
persons = source_df.select(explode("persons").alias("persons"))
persons_cars = persons.select(col("persons.name").alias("persons_name"),col("persons.age").alias("persons_age"),explode("persons.cars").alias("persons_cars_brands"),col("persons_cars_brands.name").alias("persons_cars_brand"))
persons_cars_models = persons_cars.select(col("persons_name"),col("persons_age"),col("persons_cars_brand"),explode("persons_cars_brands.models").alias("persons_cars_model"))
persons_cars_models.show()
+------------+-----------+------------------+------------------+
|persons_name|persons_age|persons_cars_brand|persons_cars_model|
+------------+-----------+------------------+------------------+
| John| 30| Ford| Fiesta|
| John| 30| Ford| Focus|
| John| 30| Ford| Mustang|
| John| 30| BMW| 320|
| John| 30| BMW| X3|
| John| 30| BMW| X5|
| Peter| 46| Huyndai| i10|
| Peter| 46| Huyndai| i30|
| Peter| 46| Mercedes| E320|
| Peter| 46| Mercedes| E63 AMG|
+------------+-----------+------------------+------------------+