我有这种格式的数据:
(123456,(43,4851))
(000456,(43,4851))
其中第一个术语是point id,其中第二个术语是一对,其第一个id是一个cluster-centroid,第二个id是另一个cluster-centroid。这就是说123456点被分配给了簇43和4861。
我要做的是创建这种格式的数据:
(43,[123456,000456])
(4861,[123456,000456])
其中的想法是每个质心都有一个分配给它的点列表。该列表必须最大长度为150。
我可以在spark或python中使用哪些内容让我的生活更轻松?
我不关心快速访问和订单。我有100米点和16k质心。
以下是我用来玩的一些人工数据:
data = []
from random import randint
for i in xrange(0, 10):
data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
data = sc.parallelize(data)
答案 0 :(得分:1)
从你所描述的内容来看(尽管我还是不太了解),这是一个使用Python的天真方法:
In [1]: from itertools import groupby
In [2]: from random import randint
In [3]: data = [] # create random samples as you did
...: for i in range(10):
...: data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
...:
In [4]: result = [] # create a intermediate list to transform your sample
...: for point_id, cluster in data:
...: for index, c in enumerate(cluster):
# I made it up following your pattern
...: result.append((c, [point_id, str(index * 100).zfill(3) + str(point_id)[-3:]]))
# sort the result by point_id as key for grouping
...: result = sorted(result, key=lambda x: x[1][0])
...:
In [5]: result[:3]
Out[5]:
[(4020, [5002188, '000188']),
(10983, [5002188, '100188']),
(10800, [24763401, '000401'])]
In [6]: capped_result = []
# basically groupby sorted point_id and cap the list max at 150
...: for _, g in groupby(result, key=lambda x: x[1][0]):
...: grouped = list(g)[:150]
...: capped_result.extend(grouped)
# final result will be like
...: print(capped_result)
...:
[(4020, [5002188, '000188']), (10983, [5002188, '100188']), (10800, [24763401, '000401']), (12965, [24763401, '100401']), (6369, [24924435, '000435']), (429, [24924435, '100435']), (7666, [39240078, '000078']), (2526, [39240078, '100078']), (5260, [47597265, '000265']), (7056, [47597265, '100265']), (2824, [60159219, '000219']), (5730, [60159219, '100219']), (7837, [67208338, '000338']), (12475, [67208338, '100338']), (4897, [80084812, '000812']), (13038, [80084812, '100812']), (2944, [80253323, '000323']), (1922, [80253323, '100323']), (12777, [96811112, '000112']), (5463, [96811112, '100112'])]
当然,这根本没有优化,但会为您提供一个如何解决这个问题的先机。我希望这会有所帮助。