在pyspark中应用用户定义的聚合函数的替代方法

时间:2018-01-29 12:04:08

标签: python apache-spark pyspark user-defined-functions

我正在尝试将用户定义的聚合函数应用于spark数据帧,以应用加法平滑,请参阅下面的代码:

import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, col, collect_list, concat_ws, udf

try:
    sc
except NameError:
    sc = ps.SparkContext()
    sqlContext = SQLContext(sc)

df = sqlContext.createDataFrame([['A', 1],
                            ['A',1],
                            ['A',0],
                            ['B',0],
                            ['B',0],
                            ['B',1]], schema=['name', 'val'])


def smooth_mean(x):
    return (sum(x)+5)/(len(x)+5)

smooth_mean_udf = udf(smooth_mean)

df.groupBy('name').agg(collect_list('val').alias('val'))\
.withColumn('val', smooth_mean_udf('val')).show()

这样做是否有意义?根据我的理解,这不能很好地扩展,因为我使用的是udf。我也找不到collect_list的确切工作,名称中的collect部分似乎表明数据被“收集”到边缘节点,但我认为数据被“收集”到各种各样的节点?

提前感谢您的任何反馈。

1 个答案:

答案 0 :(得分:5)

  

根据我的理解,这不会扩展

您的理解是正确的,这里最大的问题是DynamitePackage: Instantiating com.google.android.gms.ads.ChimeraMobileAdsSettingManagerCreatorImpl 01-29 16:19:28.141 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String, com.google.android.gms.ads.MobileAds$Settings) ((null):-1) 01-29 16:19:28.141 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String) ((null):-1) 01-29 16:19:28.142 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String, com.google.android.gms.ads.MobileAds$Settings) ((null):-1) 01-29 16:19:28.142 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String) ((null):-1) 01-29 16:19:30.961 6523-6523/com.recreation.cryptocurrencyrate I/Ads: Starting ad request. 01-29 16:19:30.961 6523-6523/com.recreation.cryptocurrencyrate I/Ads: This request is sent from a test device. 01-29 16:19:30.972 6523-6523/com.recreation.cryptocurrencyrate W/Ads: Not retrying to fetch app settings 01-29 16:19:33.499 6523-6540/com.recreation.cryptocurrencyrate W/Ads: There was a problem getting an ad response. ErrorCode: 0 01-29 16:19:33.506 6523-6523/com.recreation.cryptocurrencyrate W/Ads: Failed to load ad: 0 is just good old groupByKey。 Python collect_list的影响要小得多,但使用它对于简单的算术运算没有意义。

只需使用标准聚合

udf