将groupBYKey转换为ReduceByKey Pyspark

时间:2016-09-20 19:45:58

标签: apache-spark pyspark

如何在pyspark中将groupbyKey转换为reduceByKey。我附上了一个片段。这将对每个区域部门组合应用corr。我已经使用了groupbyKey,但它的速度非常缓慢和Shuffle错误(我有10-20GB的数据,每组有2-3GB)。请帮我用reduceByKey

重写这个

数据集

region dept week val1 valu2
 US    CS   1     1    2
 US    CS   2     1.5  2
 US    CS   3     1    2
 US    ELE  1     1.1  2
 US    ELE  2     2.1  2
 US    ELE  3     1    2
 UE    CS   1     2    2

输出

region dept corr  
US      CS  0.5
US      ELE 0.6
UE      CS  .3333

代码

def testFunction (key, value):
   for val in value:
        keysValue = val.asDict().keys()
        inputpdDF.append(dict([(keyRDD, val[keyRDD]) for keyRDD in keysValue])
   pdDF = pd.DataFrame(inputpdDF, columns = keysValue)
   corr = pearsonr(pdDF['val1'].astype(float),  pdDF['val1'].astype(float))[0]
   corrDict = {"region" : key.region, "dept" : key.dept, "corr": corr}                
   finalRDD.append(Row(**corrDict))
   return finalRDD

resRDD = df.select(["region", "dept", "week", "val1",  "val2"])\
           .map(lambda r: (Row(region= r.region, dept= r.dept), r))\
           .groupByKey()\
           .flatMap(lambda KeyValue: testFunction(KeyValue[0], list(KeyValue[1])))

1 个答案:

答案 0 :(得分:0)

尝试:

>>> from pyspark.sql.functions import corr
>>> df.groupBy("region", "dept").agg(corr("val1",  "val2"))