Question

我有一个（长）列表，其中0和1随机出现：

list_a = [1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1]

我想获得list_b

列表的总和，直到0出现

如果出现0，则在列表中保留0

list_b = [1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

我可以按如下方式实现：

list_b = []
for i, x in enumerate(list_a):
    if x == 0:
        list_b.append(x)
    else:
        sum_value = 0
        for j in list_a[i::-1]:
            if j != 0:
                sum_value += j
            else:
                break
        list_b.append(sum_value)
print(list_b)

但实际列表的长度非常长。

所以，我想改进高速代码。（如果不可读）

我改变了这样的代码：

from itertools import takewhile
list_c = [sum(takewhile(lambda x: x != 0, list_a[i::-1])) for i, d in enumerate(list_a)]
print(list_c)

但它还不够快。我怎样才能更有效地做到这一点？

Answer 1

你正在过度思考这个问题。

选项1
您可以根据当前值是否为0来迭代索引并相应地更新（计算累积和）。

data = [1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1]

for i in range(1, len(data)):
    if data[i]:  
        data[i] += data[i - 1]

也就是说，如果当前元素不为零，则将当前索引处的元素更新为当前值的总和加上前一个索引处的值。

print(data)
[1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

请注意，这会更新您的列表。如果您不希望这样做，可以提前制作副本 - new_data = data.copy()并以相同方式迭代new_data。

选项2
如果需要性能，可以使用pandas API。根据{{1}}的展示位置查找论坛，并使用0 + groupby计算分组累积总和，类似于上述内容：

cumsum

import pandas as pd

s = pd.Series(data)    
data = s.groupby(s.eq(0).cumsum()).cumsum().tolist()

<强>性能

首先，设置 -

print(data)
[1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

接下来，

data = data * 100000
s = pd.Series(data)

然后，单独计时，

%%timeit
new_data = data.copy()
for i in range(1, len(data)):
    if new_data[i]:  
        new_data[i] += new_data[i - 1]

328 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

因此，副本并不需要花费太多时间。最后，

%timeit data.copy()
8.49 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pandas方法在概念上是线性的（就像其他方法一样）但由于库的实现而在一定程度上更快。

Answer 2

如果你想要一个可能是内存效率最高的紧凑型原生Python解决方案，虽然不是最快的（参见评论），你可以从itertools广泛绘制：

>>> from itertools import groupby, accumulate, chain
>>> list(chain.from_iterable(accumulate(g) for _, g in groupby(list_a, bool)))
[1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

这里的步骤是：根据0的存在（将其归类）将列表分组到子列表中，获取每个子列表中值的累积总和，展平子列表。

作为Stefan Pochmann条评论，如果您的列表内容为二进制（仅包含1和0），那么您不需要传递密钥至groupby()，它将回归身份功能。在这种情况下，这比使用bool快〜30％：

>>> list(chain.from_iterable(accumulate(g) for _, g in groupby(list_a)))
[1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

Answer 3

我个人更喜欢这样的简单发电机：

def gen(lst):
    cumulative = 0
    for item in lst:
        if item:
            cumulative += item
        else:
            cumulative = 0
        yield cumulative

没有什么神奇的（当你知道yield如何工作时），易于阅读并且应该相当快。

如果你需要更多性能，你甚至可以将其包装为Cython扩展类型（我在这里使用IPython）。因此，您将失去“易于理解”的部分，并且需要“严重依赖”：

%load_ext cython

%%cython

cdef class Cumulative(object):
    cdef object it
    cdef object cumulative
    def __init__(self, it):
        self.it = iter(it)
        self.cumulative = 0

    def __iter__(self):
        return self

    def __next__(self):
        cdef object nxt = next(self.it)
        if nxt:
            self.cumulative += nxt
        else:
            self.cumulative = 0
        return self.cumulative

两者都需要消耗，例如使用list来提供所需的输出：

>>> list_a = [1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1]
>>> list(gen(list_a))
[1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]
>>> list(Cumulative(list_a))
[1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

然而，既然你问速度我想分享我的时间结果：

import pandas as pd
import numpy as np
import random

import pandas as pd
from itertools import takewhile
from itertools import groupby, accumulate, chain

def MSeifert(lst):
    return list(MSeifert_inner(lst))

def MSeifert_inner(lst):
    cumulative = 0
    for item in lst:
        if item:
            cumulative += item
        else:
            cumulative = 0
        yield cumulative

def MSeifert2(lst):
    return list(Cumulative(lst))

def original1(list_a):
    list_b = []
    for i, x in enumerate(list_a):
        if x == 0:
            list_b.append(x)
        else:
            sum_value = 0
            for j in list_a[i::-1]:
                if j != 0:
                    sum_value += j
                else:
                    break
            list_b.append(sum_value)

def original2(list_a):
    return [sum(takewhile(lambda x: x != 0, list_a[i::-1])) for i, d in enumerate(list_a)]

def Coldspeed1(data):
    data = data.copy()
    for i in range(1, len(data)):
        if data[i]:  
            data[i] += data[i - 1] 
    return data

def Coldspeed2(data):
    s = pd.Series(data)    
    return s.groupby(s.eq(0).cumsum()).cumsum().tolist()

def Chris_Rands(list_a):
    return list(chain.from_iterable(accumulate(g) for _, g in groupby(list_a, bool)))

def EvKounis(list_a):
    cum_sum = 0
    list_b = []
    for item in list_a:
        if not item:            # if our item is 0
            cum_sum = 0         # the cumulative sum is reset (set back to 0)
        else:
            cum_sum += item     # otherwise it sums further
        list_b.append(cum_sum)  # and no matter what it gets appended to the result

def schumich(list_a):
    list_b = []
    s = 0
    for a in list_a:
        s = a+s if a !=0 else 0
        list_b.append(s)
    return list_b

def jbch(seq):
    return list(jbch_inner(seq))

def jbch_inner(seq):
    s = 0
    for n in seq:
        s = 0 if n == 0 else s + n
        yield s


# Timing setup
timings = {MSeifert: [], 
           MSeifert2: [],
           original1: [], 
           original2: [],
           Coldspeed1: [],
           Coldspeed2: [],
           Chris_Rands: [],
           EvKounis: [],
           schumich: [],
           jbch: []}
sizes = [2**i for i in range(1, 20, 2)]

# Timing
for size in sizes:
    print(size)
    func_input = [int(random.random() < 0.75) for _ in range(size)]
    for func in timings:
        if size > 10000 and (func is original1 or func is original2):
            continue
        res = %timeit -o func(func_input)   # if you use IPython, otherwise use the "timeit" module
        timings[func].append(res)


%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(1)
ax = plt.subplot(111)

baseline = MSeifert2 # choose one function as baseline
for func in timings:
    ax.plot(sizes[:len(timings[func])], 
            [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
            label=func.__name__)  # you could also use "func.__name__" here instead
ax.set_ylim(0.8, 1e4)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to {}'.format(baseline)) # you could also use "func.__name__" here instead
ax.grid(which='both')
ax.legend()
plt.tight_layout()

如果您对确切结果感兴趣，我会将其放入this gist。

这是一个对数日志图并且与Cython答案相关。简而言之：越快越快，两个主要滴答之间的范围代表一个数量级。

所以除了你所拥有的解决方案之外，所有解决方案往往都在一个数量级内（至少在列表很大的时候）。奇怪的是，与纯Python方法相比，大熊猫解决方案相当慢。然而，Cython解决方案将所有其他方法击败了2倍。

Answer 4

在你不需要的时候，你发布的代码中的索引太多了。您可以跟踪累计金额，并在每次遇到0时将其重置为0。

list_a = [1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1]

cum_sum = 0
list_b = []
for item in list_a:
    if not item:            # if our item is 0
        cum_sum = 0         # the cumulative sum is reset (set back to 0)
    else:
        cum_sum += item     # otherwise it sums further
    list_b.append(cum_sum)  # and no matter what it gets appended to the result
print(list_b)  # -> [1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

Answer 5

它不必像提出的问题那样复杂，一个非常简单的方法就是这样。

list_a = [1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1]
list_b = []
s = 0
for a in list_a:
    s = a+s if a !=0 else 0
    list_b.append(s)

print list_b

Answer 6

如果你想要表演，我会使用发电机（而且它也很简单）。

def weird_cumulative_sum(seq):
    s = 0
    for n in seq:
        s = 0 if n == 0 else s + n
        yield s

list_b = list(weird_cumulative_sum(list_a_))

我认为你不会比这更好，无论如何你必须至少迭代一次list_a。

请注意，我在结果上调用了list（）以获取类似代码的列表，但是如果使用list_b的代码只使用for循环迭代它一次，那么将结果转换为列表是没有用的，把它传给发电机。

Answer 7

从Python 3.8开始并引入assignment expressions (PEP 572)（:=运算符），我们可以在列表推导中使用和增加变量：

# items = [1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1]
total = 0
[total := (total + x if x else x) for x in items]
# [1, 2, 3, 0, 1, 2, 0, 1, 0, 1, 2, 3]

此：

将变量total初始化为0，这表示运行总和
对于每个项目，这两个都：
- 通过分配表达式使当前循环的项目（total）递增total := total + x，或者如果项目为{{1，则将其重新设置为0 }}
- 同时将0映射到x的新值

计算列表的累积总和，直到出现零

7 个答案: