如何在pyspark中将字符串聚合为dictinary?

时间:2018-01-16 15:54:31

标签: string aggregation

我有一个数据框,我想聚合到每日。

   data = [

  (125, '2012-10-10','good'),

  (20, '2012-10-10','good'),

  (40, '2012-10-10','bad'),

  (60, '2012-10-10','NA')]

  df = spark.createDataFrame(data, ["temperature", "date","performance"])

我可以使用像max,min,avg这样的函数内置的spark聚合数值。我怎么能聚合字符串。

我期待的是:

日期| max_temp | min_temp | performance_frequency

2012-10-10 | 125 | 20 | “好”:2,“坏”:1,“NA”:1

谢谢。

1 个答案:

答案 0 :(得分:1)

我们可以使用MapType和UDF with Counter来返回值计数

from pyspark.sql import functions as F
from pyspark.sql.types import MapType,StringType,IntegerType
from collections import Counter

data = [(125, '2012-10-10','good'),(20, '2012-10-10','good'),(40, '2012-10-10','bad'),(60, '2012-10-10','NA')]
df = spark.createDataFrame(data, ["temperature", "date","performance"])

udf1 = F.udf(lambda x: dict(Counter(x)),MapType(StringType(),IntegerType()))

df.groupby('date').agg(F.min('temperature'),F.max('temperature'),udf1(F.collect_list('performance')).alias('performance_frequency')).show(1,False)
+----------+----------------+----------------+---------------------------------+
|date      |min(temperature)|max(temperature)|performance_frequency            |
+----------+----------------+----------------+---------------------------------+
|2012-10-10|20              |125             |Map(NA -> 1, bad -> 1, good -> 2)|
+----------+----------------+----------------+---------------------------------+

df.groupby('date').agg(F.min('temperature'),F.max('temperature'),udf1(F.collect_list('performance')).alias('performance_frequency')).collect()
[Row(date='2012-10-10', min(temperature)=20, max(temperature)=125, performance_frequency={'bad': 1, 'good': 2, 'NA': 1})]

希望这有帮助!