我正在尝试将用户定义的聚合函数应用于spark数据帧,以应用加法平滑,请参阅下面的代码:
import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, col, collect_list, concat_ws, udf
try:
sc
except NameError:
sc = ps.SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([['A', 1],
['A',1],
['A',0],
['B',0],
['B',0],
['B',1]], schema=['name', 'val'])
def smooth_mean(x):
return (sum(x)+5)/(len(x)+5)
smooth_mean_udf = udf(smooth_mean)
df.groupBy('name').agg(collect_list('val').alias('val'))\
.withColumn('val', smooth_mean_udf('val')).show()
这样做是否有意义?根据我的理解,这不能很好地扩展,因为我使用的是udf
。我也找不到collect_list
的确切工作,名称中的collect
部分似乎表明数据被“收集”到边缘节点,但我认为数据被“收集”到各种各样的节点?
提前感谢您的任何反馈。
答案 0 :(得分:5)
根据我的理解,这不会扩展
您的理解是正确的,这里最大的问题是DynamitePackage: Instantiating com.google.android.gms.ads.ChimeraMobileAdsSettingManagerCreatorImpl
01-29 16:19:28.141 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String, com.google.android.gms.ads.MobileAds$Settings) ((null):-1)
01-29 16:19:28.141 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String) ((null):-1)
01-29 16:19:28.142 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String, com.google.android.gms.ads.MobileAds$Settings) ((null):-1)
01-29 16:19:28.142 6523-6523/com.recreation.cryptocurrencyrate I/art: at void com.google.android.gms.ads.MobileAds.initialize(android.content.Context, java.lang.String) ((null):-1)
01-29 16:19:30.961 6523-6523/com.recreation.cryptocurrencyrate I/Ads: Starting ad request.
01-29 16:19:30.961 6523-6523/com.recreation.cryptocurrencyrate I/Ads: This request is sent from a test device.
01-29 16:19:30.972 6523-6523/com.recreation.cryptocurrencyrate W/Ads: Not retrying to fetch app settings
01-29 16:19:33.499 6523-6540/com.recreation.cryptocurrencyrate W/Ads: There was a problem getting an ad response. ErrorCode: 0
01-29 16:19:33.506 6523-6523/com.recreation.cryptocurrencyrate W/Ads: Failed to load ad: 0
is just good old groupByKey
。 Python collect_list
的影响要小得多,但使用它对于简单的算术运算没有意义。
只需使用标准聚合
udf