将长度列表分成平衡块

时间:2018-01-11 15:57:48

标签: python optimization counter

我必须用Python编写一个脚本。 我有一长串的整数,它们都是一个特定尺度的长度,当然有重复。我必须找到最佳的“间隔”来获得平衡的块。 一个例子

[1,2,2,5,2,4,5,4,5]

使用Counter并对我获得的结果进行排序

[(1,1)(2,3)(3,1)(4,1)(5,3)]

如果我需要两个桶我计算元素的数量(在这种情况下为8)并且除了这个数量的桶数(4),所以我需要形成一个大约4个元素的桶。 在我的代码中,我解析了元组列表,总结了元素的数量,直到这个数字大于4,所以

(1,1) >= 4? False
(1,1) + (2,3) = 4 >=4? True, break;

所以第一个间隔是1-2,而不是

(3,1) >=4? False
(3,1)+(4,1) >=4? False
(3,1)+(4,1)+(5,3) >=4? True

所以第二个间隔是3-5 在我的数据集中,我有数十万个元素,所以这个任务(计数,排序,解析)非常耗时。 有没有办法加快它?

1 个答案:

答案 0 :(得分:1)

这是一种创建大小相等的连续桶的方法。它使用collections.Counterheapq.mergeitertools.accumulateitertools.groupby

充分利用了标准库
from itertools import groupby, accumulate
from heapq import merge
from collections import Counter
from math import sin, pi
import random

# make test data a bit uneven
def mock_data(N):
    return [int(sin(2*pi*random.random())*50 + 50) for _ in range(N)]

N = 1000000

data = mock_data(N)

counts = Counter(data)
srtcnts = sorted(counts.items())

k = 7 # number of buckets

slabels, scounts = zip(*srtcnts)
# compute cumulative bin centers
bincntrs = (a - c/2 for a, c in zip(accumulate(scounts), scounts))
# mix in the optimal boundaries
split = merge(zip(bincntrs, slabels), zip(range(0, N, -(-N//k))))
# group into boundaries and stuff between boundaries;
# keep only the stuff between
res = [[v[1] for v in grp] for k, grp in groupby(split, len) if k==2]

print(res)
# show they are balanced
print([sum(counts[i] for i in chunk) for chunk in res])

示例输出:

[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80], [81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94], [95, 96, 97, 98, 99]]
[143297, 143387, 142010, 141358, 143224, 143617, 143107]