Pyspark中具有键值对的AggregateByKey函数

时间:2019-02-22 06:04:14

标签: pyspark aggregate-functions

我对pyspark中的aggregatebykey有疑问。

我有一个RDD数据集,如下所示: premierRDD = [('切尔西',('2016–2017',93)),(''切尔西',('2015–2016',50))

我希望使用aggegrateByKey函数将50和93的得分相加,我的预期输出应为: [('切尔西','2016–2017',(93,143)),(''切尔西','2015–2016',(50,143))]

SELECT TOP (200)
   [Extent1].[IncentiveOUID] AS [IncentiveOUID],
   [Extent1].[IncentiveID] AS [IncentiveID],
   [Extent1].[OrganizationUnitID] AS [OrganizationUnitID],
   [Extent1].[ModifiedDate] AS [ModifiedDate],
   [Extent1].[ModifiedBy] AS [ModifiedBy],
   [Extent1].[Id] AS [Id]
   FROM ( SELECT [Extent1].[IncentiveOUID] AS [IncentiveOUID], [Extent1].[IncentiveID] 
   AS [IncentiveID], [Extent1].[OrganizationUnitID] 
   AS [OrganizationUnitID], [Extent1].[ModifiedDate] 
   AS [ModifiedDate], [Extent1].[ModifiedBy] 
   AS [ModifiedBy], [Extent1].[Id] 
   AS [Id], row_number() OVER (ORDER BY [Extent1].[IncentiveOUID] ASC) AS [row_number]
       FROM [dbo].[cms_IncentiveOUs] AS [Extent1]
   )  AS [Extent1]
   WHERE [Extent1].[row_number] > 2400
   ORDER BY [Extent1].[IncentiveOUID] ASC

但是,我得到以下输出: [('Chelsea',('',143))]

有人可以建议我如何正确使用aggregrateByKey函数吗?

1 个答案:

答案 0 :(得分:0)

我调整了您的代码以实现所需的结果。首先,您需要在seqFunc中维护“ year”值。因此,我在此处添加了y[0]。然后必须更改组合以不仅包含和,而且还包含元组中的原始值。此外,年度价值也保持不变。正如我在注释中解释的那样,这将导致[('Chelsea', [(u'2016-2017', (93, 143)), (u'2015-2016', (50, 143))])],将合并相同的键。要获得2倍切尔西的输出,您可以使用上述附加地图功能。

rdd = sc.parallelize([('Chelsea', (u"2016-2017", 93)), ('Chelsea', (u"2015-2016", 50))])
seqFunc = (lambda x, y: (y[0], x[0] + y[1]))
combFunc = (lambda x, y: [(x[0], (x[1],x[1] + y[1])),(y[0],(y[1],x[1]+y[1]))])

premierAgg = rdd.aggregateByKey((0,0), seqFunc,combFunc)
print premierAgg.map(lambda r: [(r[0], a) for a in r[1]]).collect()[0]

输出:

[('Chelsea', (u'2016-2017', (93, 143))), ('Chelsea', (u'2015-2016', (50, 143)))]