在使用带有pyspark的RDD时,如何对另一个字段进行平均而将另一个字段进行分组?

时间:2016-02-11 09:04:26

标签: python apache-spark pyspark rdd

我在groupByaggregatereduceByKeymap等之间绕着轴缠绕。我的目标是平均场16(最后一场)对于字段2的每个唯一值。

因此输出可能类似于:

NW  -8
DL  -6
OO  -1

给出具有以下元素的RDD:

  

[U' 2002年4月28日,NW,19386,DTW,MI,42.21,-83.35,MSP,MN,44.88,-93.22,1220,1252,32,1316,1350,34' ,u&#; 2012-05-04,OO,20304,LSE,WI,43.87,-91.25,MSP,MN,44.88,-93.22,1130,1126,-4,1220,1219,-1', u' 2002-08-18,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,805,804,-1,959,952,-7',u' 2004-07 -29,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,800,757,-3,951,933,-18',u' 2008-07-21,NW,19386, IND,IN,39.71,-86.29,MSP,MN,44.88,-93.22,1143,1140,-3,1228,1222,-6',u' 2007-10-29,NW,19386,RST ,MN,43.9,-92.5,MSP,MN,44.88,-93.22,1546,1533,-13,1639,1609,-30',u' 2012-12-24,DL,19790,BOS, MA,42.36,-71,MSP,MN,44.88,-93.22,1427,1431,4,1648,1635,-13',u' 2010-04-22,DL,19790,DTW,MI, 42.21,-83.35,MSP,MN,44.88,-93.22,930,927,-3,1028,1008,-20',u' 2010-06-01,DL,19790,DTW,MI,42.21, - 83.35,MSP,MN,44.88,93.22,835,846,11,930,946,16',u' 2003-09-04,NW,19386,BUF,NY,42.94,-78.73,MSP,MN,44.88,-93.22 ,900852,-8,1017,955,-22']

2 个答案:

答案 0 :(得分:2)

这是一个解决方案:

data = [u'2002-04-28,NW,19386,DTW,MI,42.21,-83.35,MSP,MN,44.88,-93.22,1220,1252,32,1316,1350,34', u'2012-05-04,OO,20304,LSE,WI,43.87,-91.25,MSP,MN,44.88,-93.22,1130,1126,-4,1220,1219,-1', u'2002-08-18,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,805,804,-1,959,952,-7', u'2004-07-29,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,800,757,-3,951,933,-18', u'2008-07-21,NW,19386,IND,IN,39.71,-86.29,MSP,MN,44.88,-93.22,1143,1140,-3,1228,1222,-6', u'2007-10-29,NW,19386,RST,MN,43.9,-92.5,MSP,MN,44.88,-93.22,1546,1533,-13,1639,1609,-30', u'2012-12-24,DL,19790,BOS,MA,42.36,-71,MSP,MN,44.88,-93.22,1427,1431,4,1648,1635,-13', u'2010-04-22,DL,19790,DTW,MI,42.21,-83.35,MSP,MN,44.88,-93.22,930,927,-3,1028,1008,-20', u'2010-06-01,DL,19790,DTW,MI,42.21,-83.35,MSP,MN,44.88,93.22,835,846,11,930,946,16', u'2003-09-04,NW,19386,BUF,NY,42.94,-78.73,MSP,MN,44.88,-93.22,900,852,-8,1017,955,-22']
current_rdd = sc.parallelize(data)
rdd = current_rdd.map(lambda x : (x.split(","))).map(lambda x : (x[1],x[-1])) \
                 .groupByKey() \ # group by key
                 .map(lambda x : (x[0], map(int, list(x[1])))) \ # convert resultiterable to list
                 .map(lambda x : (x[0], float(sum(x[1]))/len(x[1]))) # compute average on list for each key
# output
rdd.take(10)
# [(u'DL', -5.666666666666667), (u'NW', -8.166666666666666), (u'OO', -1.0)]

答案 1 :(得分:2)

好的,这是一个黑暗的镜头,因为我没有任何环境可以尝试这个(而且很糟糕)

我假设您已经分割了数据中的RDD

mappedData  = data.map(lambda d : (d[1], d[-1])).cache      // (NW,34), (OO,-1), (NW,-7)
groupedData = mappedData.groupByKey().mapValues(len)        //  (NW, (34, -7)) ->  (NW, 2)
sumData = mappedData.groupByKey().mapValues(sum)            //  (NW, (34, -7)) ->  (NW, 27)
sumData.join(groupedData).map(lambda (x,y) => (x, y[0] / y[1]  )) (NW, (27,2)) -> (NW, 27/2)