我是pyspark的新手。我有一个配对RDD(键,值)。我想为每个键创建一个n桶的直方图。输出将是这样的:
[(key1, [...buckets...], [...counts...]),
(key2, [...buckets...], [...counts...])]
我已经看过检索每个键的最大值或总和的示例,但有没有办法传递直方图(n)函数以应用于每个键的值?
答案 0 :(得分:0)
我知道这篇文章相当陈旧,但对于那些仍在寻求PySpark解决方案的人来说,这就是我的两分钱。
让我们考虑一个(键,值)对RDD,让我们用"直方图"我们主要是一个简单的计数器,说明每个键有多少不同的值,以及它们各自的基数。
aggregateByKey()
是一个很好的方法。在aggregateByKey()
中,基本上声明了三个输入值:聚合器默认值,分区内聚合函数,分区间聚合函数。
让我们考虑为表单
设置一个RDD[(124, 2),
(124, 2),
(124, 2),
(125, 2),
(125, 2),
(125, 2),
(126, 2),
(126, 2),
(126, 2),
(127, 2),
(127, 2),
(127, 2),
(128, 2),
(128, 2),
(128, 2),
(129, 2),
(129, 2),
(129, 2),
(130, 2),
(130, 2),
(130, 2),
(131, 2),
(131, 2),
(131, 2),
(132, 2),
(132, 2),
(132, 2),
(133, 2),
(133, 2),
(133, 2),
(134, 2),
(134, 2),
(134, 2),
(135, 2),
(135, 2),
(135, 2),
(136, 2),
(136, 1),
(136, 2),
(137, 2),
(137, 2),
(137, 2),
(138, 2),
(138, 2),
(138, 2),
(139, 2),
(139, 2),
(139, 2),
(140, 2),
(140, 2),
(140, 2),
(141, 2),
(141, 1),
(141, 1),
(142, 2),
(142, 2),
(142, 2),
(143, 2),
(143, 2),
(143, 2),
(144, 1),
(144, 1),
(144, 2),
(145, 1),
(145, 1),
(145, 1),
(146, 2),
(146, 2),
(146, 2),
(147, 2),
(147, 2),
(147, 2),
(148, 2),
(148, 2),
(148, 2),
(149, 2),
(149, 2),
(149, 2),
(150, 2),
(150, 2),
(150, 2),
(151, 2),
(151, 2),
(151, 2),
(152, 2),
(152, 2),
(152, 2),
(153, 2),
(153, 1),
(153, 2),
(154, 2),
(154, 2),
(154, 2),
(155, 2),
(155, 1),
(155, 2),
(156, 2),
(156, 2),
(156, 2),
(157, 1),
(157, 2),
(157, 2),
(158, 2),
(158, 2),
(158, 2),
(159, 2),
(159, 2),
(159, 2),
(160, 2),
(160, 2),
(160, 2),
(161, 2),
(161, 1),
(161, 2),
(162, 2),
(162, 2),
(162, 2),
(163, 2),
(163, 1),
(163, 2),
(164, 2),
(164, 2),
(164, 2),
(165, 2),
(165, 2),
(165, 2),
(166, 2),
(166, 1),
(166, 2),
(167, 2),
(167, 2),
(167, 2),
(168, 2),
(168, 1),
(168, 1),
(169, 2),
(169, 2),
(169, 2),
(170, 2),
(170, 2),
(170, 2),
(171, 2),
(171, 2),
(171, 2),
(172, 2),
(172, 2),
(172, 2),
(173, 2),
(173, 2),
(173, 1),
(174, 2),
(174, 1),
(174, 1),
(175, 1),
(175, 1),
(175, 1),
(176, 1),
(176, 1),
(176, 1),
(177, 2),
(177, 2),
(177, 2)]
据我所知,最简单的方法是根据Python字典聚合每个键中的值,其中字典键是RDD值,与每个字典键关联的值是RDD的计数器的计数器每个RDD值都有值。不需要考虑RDD密钥,因为aggregateByKey()
函数会自动处理RDD密钥。
聚合调用的格式为
myRDD.aggregateByKey(dict(), withinPartition, betweenPartition)
我们将所有累加器初始化为空字典。
因此,分区内聚合功能具有以下形式
def withinPartition(dictionary, record):
if record in dictionary.keys():
dictionary[record] += 1
else:
dictionary[record] = 1
return dictionary
其中dictionary
是每RDD值计数器,而record
是给定的RDD值(整数,在本例中,请参见上面的RDD示例)。基本上,如果字典中已经存在给定的RDD值,我们会增加+1
计数器。否则,我们会初始化计数器。
分区间功能几乎相同
def betweenPartition(dictionary1, dictionary2):
return {k: dictionary1.get(k, 0) + dictionary2.get(k, 0) for k in set(dictionary1) | set(dictionary2)}
基本上,对于给定的RDD密钥,让我们考虑使用两个字典。我们通过对给定键的值求和,或者如果在两个词典之一中不存在给定键(逻辑OR),则将这两个词典合并为唯一字典。致georg's solution in this post字典合并的信用。
生成的RDD将具有
形式[(162, {2: 3}),
(132, {2: 3}),
(168, {1: 2, 2: 1}),
(138, {2: 3}),
(174, {1: 2, 2: 1}),
(144, {1: 2, 2: 1}),
(150, {2: 3}),
(156, {2: 3}),
(126, {2: 3}),
(163, {1: 1, 2: 2}),
(133, {2: 3}),
(169, {2: 3}),
(139, {2: 3}),
(175, {1: 3}),
(145, {1: 3}),
(151, {2: 3}),
(157, {1: 1, 2: 2}),
(127, {2: 3}),
(128, {2: 3}),
(164, {2: 3}),
(134, {2: 3}),
(170, {2: 3}),
(140, {2: 3}),
(176, {1: 3}),
(146, {2: 3}),
(152, {2: 3}),
(158, {2: 3}),
(129, {2: 3}),
(165, {2: 3}),
(135, {2: 3}),
(171, {2: 3}),
(141, {1: 2, 2: 1}),
(177, {2: 3}),
(147, {2: 3}),
(153, {1: 1, 2: 2}),
(159, {2: 3}),
(160, {2: 3}),
(130, {2: 3}),
(166, {1: 1, 2: 2}),
(136, {1: 1, 2: 2}),
(172, {2: 3}),
(142, {2: 3}),
(148, {2: 3}),
(154, {2: 3}),
(124, {2: 3}),
(161, {1: 1, 2: 2}),
(131, {2: 3}),
(167, {2: 3}),
(137, {2: 3}),
(173, {1: 1, 2: 2}),
(143, {2: 3}),
(149, {2: 3}),
(155, {1: 1, 2: 2}),
(125, {2: 3})]
原始RDD密钥仍可在此新RDD中找到。每个新的RDD值都是字典。反过来,每个字典键对应于可能的RDD值之一,而每个字典值是对于每个RDD键存在给定RDD值的次数的计数器。
答案 1 :(得分:-1)
尝试:
>>> import numpy as np
>>>
>>> rdd.groupByKey().map(lambda (x, y): np.histogram(list(y)))