如何使用pyspark在aws胶中嵌套json中的数组变平?

时间:2019-10-04 15:51:11

标签: arrays json pyspark pyspark-sql aws-glue

我正在尝试展平JSON文件,以便能够将其全部加载到AWS Glue中的PostgreSQL中。我正在使用PySpark。使用搜寻器搜寻S3 JSON并生成一个表。然后,我使用ETL Glue脚本执行以下操作:

  • 读取抓取的表
  • 使用“相关性”功能展平文件
  • 将动态框架转换为数据框架
  • 尝试“分解” request.data字段

到目前为止的脚本:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")

df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")

df1 = df0.select(dfc_root_table_name)

df2 = df1.toDF()

df2 = df1.select(explode(col('`request.data`')).alias("request_data"))

<then i write df1 to a PostgreSQL database which works fine>

我面对的问题:

“ relationalize”功能运行良好,但request.data字段变为bigint,因此“ explode”不起作用。

由于数据的结构,如果不首先在JSON上使用'Relationalize',就无法进行爆炸。具体的错误是:“ org.apache.spark.sql.AnalysisException:由于数据类型不匹配而无法解析'explode(request.data)':函数explode的输入应该是数组或映射类型,而不是bigint”

如果我尝试首先将动态框架设为数据框架,则会出现以下问题:“ py4j.protocol.Py4JJavaError:调用o72.jdbc时发生错误。 :java.lang.IllegalArgumentException:无法获取结构的JDBC类型...”

我还尝试上传一个分类器,以使数据在爬网本身中变平,但是AWS确认这是行不通的。

原始文件的JSON格式如下,我试图对其进行标准化:

- field1
- field2
- {}
  - field3
  - {}
    - field4
    - field5
  - []
    - {}
      - field6
      - {}
        - field7
        - field8
        - {}
          - field9
          - {}
            - field10

2 个答案:

答案 0 :(得分:0)

一旦您合理化了json列,就无需爆炸它。 Relationalize将嵌套的JSON转换为JSON文档最外层的键/值对。转换后的数据会维护嵌套JSON中原始关键字的列表,并以句点分隔。

示例:

嵌套json:

{
    "player": {
        "username": "user1",
        "characteristics": {
            "race": "Human",
            "class": "Warlock",
            "subclass": "Dawnblade",
            "power": 300,
            "playercountry": "USA"
        },
        "arsenal": {
            "kinetic": {
                "name": "Sweet Business",
                "type": "Auto Rifle",
                "power": 300,
                "element": "Kinetic"
            },
            "energy": {
                "name": "MIDA Mini-Tool",
                "type": "Submachine Gun",
                "power": 300,
                "element": "Solar"
            },
            "power": {
                "name": "Play of the Game",
                "type": "Grenade Launcher",
                "power": 300,
                "element": "Arc"
            }
        },
        "armor": {
            "head": "Eye of Another World",
            "arms": "Philomath Gloves",
            "chest": "Philomath Robes",
            "leg": "Philomath Boots",
            "classitem": "Philomath Bond"
        },
        "location": {
            "map": "Titan",
            "waypoint": "The Rig"
        }
    }
}

合理化后整理出json:

{
    "player.username": "user1",
    "player.characteristics.race": "Human",
    "player.characteristics.class": "Warlock",
    "player.characteristics.subclass": "Dawnblade",
    "player.characteristics.power": 300,
    "player.characteristics.playercountry": "USA",
    "player.arsenal.kinetic.name": "Sweet Business",
    "player.arsenal.kinetic.type": "Auto Rifle",
    "player.arsenal.kinetic.power": 300,
    "player.arsenal.kinetic.element": "Kinetic",
    "player.arsenal.energy.name": "MIDA Mini-Tool",
    "player.arsenal.energy.type": "Submachine Gun",
    "player.arsenal.energy.power": 300,
    "player.arsenal.energy.element": "Solar",
    "player.arsenal.power.name": "Play of the Game",
    "player.arsenal.power.type": "Grenade Launcher",
    "player.arsenal.power.power": 300,
    "player.arsenal.power.element": "Arc",
    "player.armor.head": "Eye of Another World",
    "player.armor.arms": "Philomath Gloves",
    "player.armor.chest": "Philomath Robes",
    "player.armor.leg": "Philomath Boots",
    "player.armor.classitem": "Philomath Bond",
    "player.location.map": "Titan",
    "player.location.waypoint": "The Rig"
}

因此,在您的情况下, request.data 已经是从请求列中拉平的新列,其类型被spark解释为bigint。

参考:Simplify/querying nested json with the aws glue relationalize transform

答案 1 :(得分:0)

# Flatten nested df  
def flatten_df(nested_df): 
    for col in nested_df.columns:


    array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
    for col in array_cols:
        nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))

    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
    if len(nested_cols) == 0:
        return nested_df

    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']

    flat_df = nested_df.select(flat_cols +
                            [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols
                                for c in nested_df.select(nc+'.*').columns])

    return flatten_df(flat_df)

df=flatten_df(df)

它将用下划线替换所有点。请注意,在数组本身为null的情况下,它使用explode_outer而不是explode来包含Null值。此功能仅在spark v2.4+中可用。

还请记住,爆炸数组将添加更多重复项,并且总行大小将增加。展平结构会增加列大小。简而言之,原始df会水平和垂直爆炸。稍后可能会减慢数据处理速度。

因此,我的建议是识别与功能相关的数据,并将这些数据仅存储在PostgreSQL和s3中的原始json文件中。