Question

根据pyspark collect_set or collect_list with groupby中的接受的答案，当您对某个列执行collect_list时，此列中的null值将被删除。我已经检查过，这是真的。

但在我的情况下，我需要保留空列 - 我怎样才能实现这一目标？

我没有找到关于collect_list函数的这种变体的任何信息。

背景上下文解释为什么我想要空值：

我有一个数据框df，如下所示：

cId   |  eId  |  amount  |  city
1     |  2    |   20.0   |  Paris
1     |  2    |   30.0   |  Seoul
1     |  3    |   10.0   |  Phoenix
1     |  3    |   5.0    |  null

我想使用以下映射将其写入Elasticsearch索引：

"mappings": {
    "doc": {
        "properties": {
            "eId": { "type": "keyword" },
            "cId": { "type": "keyword" },
            "transactions": {
                "type": "nested", 
                "properties": {
                    "amount": { "type": "keyword" },
                    "city": { "type": "keyword" }
                }
            }
        }
    }
 }

为了符合上面的嵌套映射，我转换了我的df，以便对于eId和cId的每个组合，我有一个像这样的事务数组：

df_nested = df.groupBy('eId','cId').agg(collect_list(struct('amount','city')).alias("transactions"))
df_nested.printSchema()
root
 |-- cId: integer (nullable = true)
 |-- eId: integer (nullable = true)
 |-- transactions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: float (nullable = true)
 |    |    |-- city: string (nullable = true)

将df_nested保存为json文件，我得到了json记录：

{"cId":1,"eId":2,"transactions":[{"amount":20.0,"city":"Paris"},{"amount":30.0,"city":"Seoul"}]}
{"cId":1,"eId":3,"transactions":[{"amount":10.0,"city":"Phoenix"},{"amount":30.0}]}

正如您所看到的 - 当cId=1和eId=3时，我的一个数组元素amount=30.0没有city属性，因为这是null在我的原始数据（df）中。当我使用collect_list函数时，将删除空值。

但是，当我尝试使用上面的索引将df_nested写入elasticsearch时，它会因为模式不匹配而出错。这基本上就是为什么我想在应用collect_list函数后保留空值的原因。

Answer 1

这应该可以满足您的需求：

from pyspark.sql.functions import create_map, collect_list, lit, col, to_json

df = spark.createDataFrame([[1, 2, 20.0, "Paris"], [1, 2, 30.0, "Seoul"], 
    [1, 3, 10.0, "Phoenix"], [1, 3, 5.0, None]], 
    ["cId", "eId", "amount", "city"])

df_nested = df.withColumn(
        "transactions", 
         create_map(lit("city"), col("city"), lit("amount"), col("amount")))\
    .groupBy("eId","cId")\
    .agg(collect_list("transactions").alias("transactions"))

那给了我

+---+---+------------------------------------------------------------------+
|eId|cId|transactions                                                      |
+---+---+------------------------------------------------------------------+
|2  |1  |[[city -> Paris, amount -> 20.0], [city -> Seoul, amount -> 30.0]]|
|3  |1  |[[city -> Phoenix, amount -> 10.0], [city ->, amount -> 5.0]]     |
+---+---+------------------------------------------------------------------+

然后您感兴趣的列的json就像您希望的那样：

>>> for row in df_nested.select(to_json("transactions").alias("json")).collect():
print(row["json"])

[{"city":"Paris","amount":"20.0"},{"city":"Seoul","amount":"30.0"}]
[{"city":"Phoenix","amount":"10.0"},{"city":null,"amount":"5.0"}]

Pypsark - 使用collect_list时保留空值

1 个答案: