根据pyspark collect_set or collect_list with groupby中的接受的答案,当您对某个列执行collect_list
时,此列中的null
值将被删除。我已经检查过,这是真的。
但在我的情况下,我需要保留空列 - 我怎样才能实现这一目标?
我没有找到关于collect_list
函数的这种变体的任何信息。
背景上下文解释为什么我想要空值:
我有一个数据框df
,如下所示:
cId | eId | amount | city
1 | 2 | 20.0 | Paris
1 | 2 | 30.0 | Seoul
1 | 3 | 10.0 | Phoenix
1 | 3 | 5.0 | null
我想使用以下映射将其写入Elasticsearch索引:
"mappings": {
"doc": {
"properties": {
"eId": { "type": "keyword" },
"cId": { "type": "keyword" },
"transactions": {
"type": "nested",
"properties": {
"amount": { "type": "keyword" },
"city": { "type": "keyword" }
}
}
}
}
}
为了符合上面的嵌套映射,我转换了我的df,以便对于eId和cId的每个组合,我有一个像这样的事务数组:
df_nested = df.groupBy('eId','cId').agg(collect_list(struct('amount','city')).alias("transactions"))
df_nested.printSchema()
root
|-- cId: integer (nullable = true)
|-- eId: integer (nullable = true)
|-- transactions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: float (nullable = true)
| | |-- city: string (nullable = true)
将df_nested
保存为json文件,我得到了json记录:
{"cId":1,"eId":2,"transactions":[{"amount":20.0,"city":"Paris"},{"amount":30.0,"city":"Seoul"}]}
{"cId":1,"eId":3,"transactions":[{"amount":10.0,"city":"Phoenix"},{"amount":30.0}]}
正如您所看到的 - 当cId=1
和eId=3
时,我的一个数组元素amount=30.0
没有city
属性,因为这是null
在我的原始数据(df
)中。当我使用collect_list
函数时,将删除空值。
但是,当我尝试使用上面的索引将df_nested写入elasticsearch时,它会因为模式不匹配而出错。这基本上就是为什么我想在应用collect_list
函数后保留空值的原因。
答案 0 :(得分:2)
这应该可以满足您的需求:
from pyspark.sql.functions import create_map, collect_list, lit, col, to_json
df = spark.createDataFrame([[1, 2, 20.0, "Paris"], [1, 2, 30.0, "Seoul"],
[1, 3, 10.0, "Phoenix"], [1, 3, 5.0, None]],
["cId", "eId", "amount", "city"])
df_nested = df.withColumn(
"transactions",
create_map(lit("city"), col("city"), lit("amount"), col("amount")))\
.groupBy("eId","cId")\
.agg(collect_list("transactions").alias("transactions"))
那给了我
+---+---+------------------------------------------------------------------+
|eId|cId|transactions |
+---+---+------------------------------------------------------------------+
|2 |1 |[[city -> Paris, amount -> 20.0], [city -> Seoul, amount -> 30.0]]|
|3 |1 |[[city -> Phoenix, amount -> 10.0], [city ->, amount -> 5.0]] |
+---+---+------------------------------------------------------------------+
然后您感兴趣的列的json就像您希望的那样:
>>> for row in df_nested.select(to_json("transactions").alias("json")).collect():
print(row["json"])
[{"city":"Paris","amount":"20.0"},{"city":"Seoul","amount":"30.0"}]
[{"city":"Phoenix","amount":"10.0"},{"city":null,"amount":"5.0"}]