我对pyspark中的aggregatebykey有疑问。
我有一个RDD数据集,如下所示: premierRDD = [('切尔西',('2016–2017',93)),(''切尔西',('2015–2016',50))
我希望使用aggegrateByKey函数将50和93的得分相加,我的预期输出应为: [('切尔西','2016–2017',(93,143)),(''切尔西','2015–2016',(50,143))]
SELECT TOP (200)
[Extent1].[IncentiveOUID] AS [IncentiveOUID],
[Extent1].[IncentiveID] AS [IncentiveID],
[Extent1].[OrganizationUnitID] AS [OrganizationUnitID],
[Extent1].[ModifiedDate] AS [ModifiedDate],
[Extent1].[ModifiedBy] AS [ModifiedBy],
[Extent1].[Id] AS [Id]
FROM ( SELECT [Extent1].[IncentiveOUID] AS [IncentiveOUID], [Extent1].[IncentiveID]
AS [IncentiveID], [Extent1].[OrganizationUnitID]
AS [OrganizationUnitID], [Extent1].[ModifiedDate]
AS [ModifiedDate], [Extent1].[ModifiedBy]
AS [ModifiedBy], [Extent1].[Id]
AS [Id], row_number() OVER (ORDER BY [Extent1].[IncentiveOUID] ASC) AS [row_number]
FROM [dbo].[cms_IncentiveOUs] AS [Extent1]
) AS [Extent1]
WHERE [Extent1].[row_number] > 2400
ORDER BY [Extent1].[IncentiveOUID] ASC
但是,我得到以下输出: [('Chelsea',('',143))]
有人可以建议我如何正确使用aggregrateByKey函数吗?
答案 0 :(得分:0)
我调整了您的代码以实现所需的结果。首先,您需要在seqFunc中维护“ year”值。因此,我在此处添加了y[0]
。然后必须更改组合以不仅包含和,而且还包含元组中的原始值。此外,年度价值也保持不变。正如我在注释中解释的那样,这将导致[('Chelsea', [(u'2016-2017', (93, 143)), (u'2015-2016', (50, 143))])]
,将合并相同的键。要获得2倍切尔西的输出,您可以使用上述附加地图功能。
rdd = sc.parallelize([('Chelsea', (u"2016-2017", 93)), ('Chelsea', (u"2015-2016", 50))])
seqFunc = (lambda x, y: (y[0], x[0] + y[1]))
combFunc = (lambda x, y: [(x[0], (x[1],x[1] + y[1])),(y[0],(y[1],x[1]+y[1]))])
premierAgg = rdd.aggregateByKey((0,0), seqFunc,combFunc)
print premierAgg.map(lambda r: [(r[0], a) for a in r[1]]).collect()[0]
输出:
[('Chelsea', (u'2016-2017', (93, 143))), ('Chelsea', (u'2015-2016', (50, 143)))]