Question

我一直在尝试更新 Pyspark DataFrame 中的嵌套字段值。我遵循了 How to update a value in the nested column of struct using pyspark 给出的答案，但还没有达到我想要的水平。

json data  
{
  "documentKey": {
    "_id": "1234567"
  },
    "fullDocument": {
        "did": "1fcee68a43c500e0",
        "sg": {
            "media_ended_timestamp": 1626940125,
            "media_id": 56010
        },
        "ts": "ts"
  }
}

现在，假设我想将字段 fullDocument.sg.media_id 从 56010 更新为 11111。这样做的可能方法是什么？

注意：根据我上面粘贴的链接中提到的答案，我能够成功更新 fullDocument.did。

火花：3.1.1 蟒蛇：3.9

Answer 1

我可以用下面的代码来做到这一点

df = df.select('*', 'fullDocument.*') \
    .select('*', 'sg.*') \
    .withColumn('media_id', lit('11111')) \
    .withColumn('sg', F.struct(*[F.col(col) for col in df.select('fullDocument.sg.*').columns])) \
    .withColumn('fullDocument', F.struct(*[F.col(col) for col in df.select('fullDocument.*').columns])) \
    .drop(*df.select('fullDocument.*').columns) \
    .drop(*df.select('fullDocument.sg.*').columns)

如何更新 Pyspark 中的嵌套字段值

1 个答案: