Question

PySpark：我想传递我的自定义词典，其中包含几个位置到Pyspark中每个任务的距离，就像我的rdd中的每一行一样，我需要计算每个位置和词典中每个位置的距离并采取最小值距离。广播没有解决我的问题。

实施例： dict = {（a，3），（b，6），（c，2）} RDD：（location1,5）（位置2,9）（location3,8）

输出：（location1,1）（location2,3）（location3,2）

请帮助和谢谢

Answer 1

在这种情况下，广播变量肯定会解决您的问题，但您也可以在地图功能中传递字典（或列表 - 见下文）。是否值得使用广播变量取决于对象的大小。

首先，因为您想要的只是最小距离，所以看起来您并不关心字典的键，只关注值。如果对该列表进行排序，则可以有效地找到最小距离。

>>> d = {'a': 3, 'b': 6, 'c': 2}
>>> locations = sorted(d.itervalues())
>>> rdd = sc.parallelize([('location1', 5), ('location2', 9), ('location3', 8)])

现在使用bisect.bisect定义一个函数来查找最小距离。我们使用functools.partial从一般函数中创建单个元素的函数来修复第二个参数。

>>> from functools import partial
>>> from bisect import bisect
>>> def find_min_distance(loc, locations):
...     ind = bisect(locations, loc)
...     if ind == len(locations):
...         return loc - locations[-1]
...     elif ind == 0:
...         return locations[0] - loc
...     else:
...         left_dist = loc - locations[ind - 1]
...         right_dist = locations[ind] - loc
...         return min(left_dist, right_dist)
>>> mapper = partial(find_min_distance, locations=locations)
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]

使用广播变量代替此操作：

>>> locations_bv = sc.broadcast(locations)
>>> def mapper(loc):
...     return find_min_distance(loc, locations_bv.value)
...
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]

Pyspark：将完整的字典传递给每个任务

1 个答案: