Numba

Question

我有一个pandas数据框，我想在其数据行上循环并计算从第一行到第二行的度量，如果在那里找不到，请从第一行到第三行，第四行等等，然后将此度量与另一个价值。我想获取首先满足条件的行号。举一个具体的例子，对于长度为30的数据帧，它可能来自df.iloc[0:10] df.iloc[10:15]和df.iloc[15:27]，df.iloc[27:30]，其中值10、15、27存储在其中列表。

示例数据框：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100, size=(100, 1)), columns=list('A'))
df  
    A
0   5
1  11
2   8
3   1
4  16
5  24

some_value = 20 
mylist = []
for i in range(len(df)):
    for j in range(i+2, range(len(df)):
        # Metric calculated on the relevant rows
        metric = df.iloc[i:j]['A'].sum()
        if metric >= some_value:
           mylist.append(j)
           break

循环以df.iloc[0:2]开始，并计算5 + 11，因为它不大于some_value（20），所以它传递到df.iloc[0:3]。这次，由于5 + 11 + 8大于some_value，因此我想保存此数字（2），而不检查df.iloc[0:4]。然后循环应该从df.iloc[3:5]开始再次检查（1 + 16），因为不满足条件，请继续df.iloc[3:6]（1 + 16 + 24），依此类推并保存点满足条件时。

在这种情况下的示例输出是带有值的列表： [2, 5]

我写了上面的代码，但不能完全实现我想要的功能。您能帮忙解决这个问题吗？谢谢。

Answer 1

当前，您的循环为O（n ^ 2）。但是，一旦找到起始值i的匹配项，您的外循环就必须从i + 1开始，并且您不想从此处开始。您想从j开始。这是您的代码的快速修复。

目前我没有numpy，所以我使用python列表作为数据。

data = [5, 11, 8, 1, 16, 24]
some_value = 20 
mylist = []
j = 0
for i in range(len(data)):
    # can't change iteration so just skip ahead with continue
    if i < j:
        continue
    # range expects second argument to be past the end
    # dunno if df is the same, but probably?
    for j in range(i+1, len(data)+1):
        metric = sum(data[i:j])
        if metric >= some_value:
            mylist.append(j-1)
            break
print(mylist)

[2，5]

我建议在一个循环中这样做，并保持运行总计（累加器）。在这里，我有点想返回范围，以防您想拼接df：

data = [5, 11, 8, 1, 16, 24]
threshold = 20

def accumulate_to_threshold(data, threshold):
    start = 0
    total = 0
    for index, item in enumerate(data):
        total += item
        if total > threshold:
            yield (start, index+1)
            total = 0
            start = index+1
    # leftovers below threshold here

for start, end in accumulate_to_threshold(data, threshold):
    sublist = data[start:end]
    print (sublist, "totals to", sum(sublist))

[5，11，8]总计24
[1，16，24]总计41

当然，除了产生范围外，您还可以产生索引并从上方获取[2，5]。

Answer 2

我的方法是：

numpy.reshape(values, newshape, ...)
Order Deny,Allow
布尔面具

我不知道这是否可以按照您想要的方式回答您的问题，但是我将展示我的大脑如何使用内置的pandas / numpy向量化方法来解决这个问题，总之循环很麻烦（缓慢），应尽可能避免使用：

.sum(axis=1)

import pandas as pd import numpy as np # made it smaller df = pd.DataFrame(np.random.randint(0,25, size=(20, 1)), columns=list('A'))和numpy.reshape()

我们将重塑sum()列的形状，它会将值并排移动，然后相加经过A：

比较下面的axis=1与df。注意这些值是如何重新排列的

re_shaped

布尔型面具


re_shaped = np.reshape(df.A.values, (10, 2))
print(df)

     A
0    5
1   11
2    8
3   23
...
16   6
17  14
18   3
19   0

print(re_shaped)

array([[ 5, 11],
       [ 8, 23],
       ...
       [ 6, 14],
       [ 3,  0]])

summed = re_shaped.sum(axis=1)
print(summed)

array([16, 31, 15, 19, 13, 21, 28, 30, 20,  3])

就在那里。希望能有所帮助。

Answer 3

您是否考虑过仅使用一个循环：

import pandas as pd
import numpy as np

n = int(1e6)
df = pd.DataFrame({"A": np.random.randint(100, size=n)})

threshold = 20
my_list = []
s = 0
for i, k in enumerate(df["A"].values):
    if s + k > threshold:
        my_list.append(i)
        s = 0
    else:
        s += k

您最终可以使用numba，但我认为最好的办法是计算df中重置后的总和。

Numba

前一个可以写为函数

def fun(vec, threshold=20):
    my_list = []
    s = 0
    for i, k in enumerate(vec):
        if s + k > threshold:
            my_list.append(i)
            s = 0
        else:
            s += k
    return my_list

我们可以使用numba

from numba import jit

@jit(nopython=True, cache=True, nogil=True)
def fun_numba(vec, threshold=20):
    my_list = []
    s = 0
    for i, k in enumerate(vec):
        if s + k > threshold:
            my_list.append(i)
            s = 0
        else:
            s += k
    return my_list

%%timeit -n 5 -r 5
my_list = fun(df["A"].values)

606 ms ± 28 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
my_list = fun_numba(df["A"].values)

59.6 ms ± 20.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

大约是10倍的加速速度。

Python遍历数据帧行，直到第一次满足条件

3 个答案:

Numba