如何在pyspark中将groupbyKey转换为reduceByKey。我附上了一个片段。这将对每个区域部门组合应用corr。我已经使用了groupbyKey,但它的速度非常缓慢和Shuffle错误(我有10-20GB的数据,每组有2-3GB)。请帮我用reduceByKey
重写这个数据集
region dept week val1 valu2
US CS 1 1 2
US CS 2 1.5 2
US CS 3 1 2
US ELE 1 1.1 2
US ELE 2 2.1 2
US ELE 3 1 2
UE CS 1 2 2
输出
region dept corr
US CS 0.5
US ELE 0.6
UE CS .3333
代码
def testFunction (key, value):
for val in value:
keysValue = val.asDict().keys()
inputpdDF.append(dict([(keyRDD, val[keyRDD]) for keyRDD in keysValue])
pdDF = pd.DataFrame(inputpdDF, columns = keysValue)
corr = pearsonr(pdDF['val1'].astype(float), pdDF['val1'].astype(float))[0]
corrDict = {"region" : key.region, "dept" : key.dept, "corr": corr}
finalRDD.append(Row(**corrDict))
return finalRDD
resRDD = df.select(["region", "dept", "week", "val1", "val2"])\
.map(lambda r: (Row(region= r.region, dept= r.dept), r))\
.groupByKey()\
.flatMap(lambda KeyValue: testFunction(KeyValue[0], list(KeyValue[1])))
答案 0 :(得分:0)
尝试:
>>> from pyspark.sql.functions import corr
>>> df.groupBy("region", "dept").agg(corr("val1", "val2"))