Question

我正在寻找python中的优化工具来执行一个数组操作任务，我发现自己一遍又一遍地做。如果该工具已经存在，例如在numpy或pandas中，我宁愿实现，而是继续使用我自己的cythonized for循环。

我有两个长度相同的数组，A和B，存储有关分组数据的一些数据。数组A的第i个条目告诉我第i组的一些属性;数组B的第j个条目告诉我j组中有多少成员;商店花车，B商店注册。因此，如果确定，A [5] = 100.4＆amp; B [5] = 7，然后第5组的质量等于100.4，该组有7名成员。

我的目标是创建一个新的浮点数组C，长度为B.sum（），这是上述数据集的扩展。所以C [0：B [0]] = A [0]，C [B [0]：B [1]] = A [1]，依此类推。是否有优化的解决方案可以在现有的库中执行此操作，例如pandas？

我现有的解决方案是初始化一个空数组C，然后在A的元素上运行for循环，如上所述索引C的公共元素。为了速度，我一直在编写和编译cython中的for循环。但是这个特殊的操作是我的代码中最大的瓶颈，在处理表格数据时似乎是一个非常常见的数组操作，所以我想知道是否有一个经过大量优化的算法来实现它。 / p>

Answer 1

Numpy对这类事情有重复（）。

给出两个数组

A = np.array([100.4,98.3,88.5])
B = np.array([7,3,10])
np.repeat(A,B)

会给你

array([ 100.4,  100.4,  100.4,  100.4,  100.4,  100.4,  100.4,   98.3,
         98.3,   98.3,   88.5,   88.5,   88.5,   88.5,   88.5,   88.5,
         88.5,   88.5,   88.5,   88.5])

Answer 2

In [58]: A = [100.4, 50.0]

In [59]: B = [7, 5]

In [60]: [A[i] for i in range(len(B)) for _ in range(B[i])]
Out[60]: [100.4, 100.4, 100.4, 100.4, 100.4, 100.4, 100.4, 50.0, 50.0, 50.0, 50.0, 50.0]

Answer 3

执行此操作的一种可能方法是使用itertools函数创建迭代器：

>>> A = np.array([100.4,98.3,88.5])
>>> B = np.array([7,3,10])
>>>
>>> from itertools import chain, izip, repeat
>>> res = chain(*(repeat(*x) for x in izip(A,B)))
>>> list(res)
[100.4, 100.4, 100.4, 100.4, 100.4, 100.4, 100.4,
 98.3, 98.3, 98.3,
 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5]

<强>更新

>>> A1 = ['A', 3, [1,2]]
>>> A2 = [len, lambda x: x * 3, sum]
>>> B = [2, 3, 4]
>>>
>>> c = chain(*(repeat((a1, a2(a1)), b) for a1, a2, b in izip(A1, A2, B)))
>>> list(c)
[('A', 1), ('A', 1),
 (3, 9), (3, 9), (3, 9),
 ([1, 2], 3), ([1, 2], 3), ([1, 2], 3), ([1, 2], 3)]

这个解决方案的好处是你不必实际存储所有这些元素，你可以从迭代器中获取它

您也可以使用imap代替生成器：

>>> from itertools import chain, izip, repeat, imap
>>> A1 = ['A', 3, [1,2]]
>>> A2 = ['C', 4, 12]
>>> B = [2, 3, 4]
>>> for x in chain(*imap(repeat, izip(A1, A2), B)):
...     print x
... 
('A', 'C')
('A', 'C')
(3, 4)
(3, 4)
(3, 4)
([1, 2], 12)
([1, 2], 12)
([1, 2], 12)
([1, 2], 12)

Answer 4

好的，再次感谢大家的欢呼，这对我的工作来说是一个非常有用和有益的线索。我已经从假期回来了，现在将根据发送者的要求发布我的测试结果 - 如果我没有对所提出的任何解决方案进行最佳编码，请发出声音。

首先，这是我的虚假数据，为清晰起见交易冗长（欢迎使多行格式更清晰的提示）：

Ngrps=int(1.e6)
grp_prop1=np.random.random(Ngrps)
grp_prop2=np.random.random(Ngrps)
grp_prop3=np.random.random(Ngrps)
grp_prop4=np.random.random(Ngrps)
grp_prop5=np.random.random(Ngrps)
grp_prop6=np.random.random(Ngrps)
grp_occupation=np.random.random_integers(0,5,size=Ngrps)

现在让我们从我发现的最快的算法开始，即numpy解决方案，我的笔记本电脑需要0.15秒，Bob Haffner建议

mmbr_prop1=np.repeat(grp_prop1, grp_occupation)
mmbr_prop2=np.repeat(grp_prop2, grp_occupation)
mmbr_prop3=np.repeat(grp_prop3, grp_occupation)
mmbr_prop4=np.repeat(grp_prop4, grp_occupation)
mmbr_prop5=np.repeat(grp_prop5, grp_occupation)
mmbr_prop6=np.repeat(grp_prop6, grp_occupation)

下一个最快的，1.21秒，是一个压缩列表理解，由督察G4dget建议

zipped_grps = zip(grp_prop1, grp_prop2, grp_prop3, grp_prop4, grp_prop5, grp_prop6)
zipped_mmbr_props = [zipped_grps[i] for i in range(len(grp_occupation)) for _ in range(grp_occupation[i])]

单独拉动团体的行为超过了2倍的加速。当我没有压缩组数据时，列表推导解决方案需要2.71秒：

z=[(grp_prop1[i], grp_prop2[i], grp_prop3[i], grp_prop4[i], grp_prop5[i], grp_prop6[i]) for i in range(len(grp_occupation)) for _ in range(grp_occupation[i])]

Roman Pekar建议的itertools解决方案耗时2.4秒：

zipped_grps = izip(grp_prop1, grp_prop2, grp_prop3, grp_prop4, grp_prop5, grp_prop6, grp_occupation)
c = chain(*(repeat((p1, p2, p3, p4, p5, p6), n) for p1, p2, p3, p4, p5, p6, n in zipped_grps))

最后，我最初编写的for循环需要4.8秒：

Ntot_mbrs = grp_occupation.sum()
data=np.zeros(Ntot_mbrs*6).reshape(6, Ntot_mbrs)
first_index=0
for i in range(len(grp_occupation)):
    data[0][first_index:first_index+grp_occupation[i]] = grp_prop1[i]
    data[1][first_index:first_index+grp_occupation[i]] = grp_prop2[i]
    data[2][first_index:first_index+grp_occupation[i]] = grp_prop3[i]
    data[3][first_index:first_index+grp_occupation[i]] = grp_prop4[i]
    data[4][first_index:first_index+grp_occupation[i]] = grp_prop5[i]
    data[5][first_index:first_index+grp_occupation[i]] = grp_prop6[i]
    first_index += grp_occupation[i]

所以，由于在这个帖子中提出的建议，我加快了我的代码超过30倍。非常感谢，大家！

用于扩展分组表格数据的高效算法

4 个答案: