我有以下熊猫数据框
Code Sum Quantity
0 -12 0
1 23 0
2 -10 0
3 -12 0
4 100 0
5 102 201
6 34 0
7 -34 0
8 -23 0
9 100 0
10 100 0
11 102 300
12 -23 0
13 -25 0
14 100 123
15 167 167
我想要的数据框是
Code Sum Quantity new_sum
0 -12 0 -12
1 23 0 23
2 -10 0 -10
3 -12 0 -12
4 100 0 0
5 102 201 202
6 34 0 34
7 -34 0 -34
8 -23 0 -23
9 100 0 0
10 100 0 0
11 102 300 302
12 -23 0 -23
13 -25 0 -25
14 100 123 100
15 167 167 167
逻辑是:
首先,我将在 quantity 列中检查非零值。在上面的示例数据中,我们在索引4处获得了第一个非零出现的 quantity ,即201。然后我想添加列 sum 直到在列中得到负值。行。
我已经编写了一个代码,该代码使用嵌套的if
语句。但是,由于多次执行if和逐行比较,因此执行该代码需要花费很多时间。
current_stock = 0
for i in range(len(test)):
if(test['Quantity'][i] != 0):
current_stock = test['Sum'][i]
if(test['Sum'][i-1] > 0):
current_stock = current_stock + test['Sum'][i-1]
test['new_sum'][i-1] = 0
if(test['Sum'][i-2] > 0):
current_stock = current_stock + test['Sum'][i-2]
test['new_sum'][i-2] = 0
if(test['Sum'][i-3] > 0):
current_stock = current_stock + test['Sum'][i-3]
test['new_sum'][i-3] = 0
else:
test['new_sum'][i] = current_stock
else:
test['new_sum'][i] = current_stock
else:
test['new_sum'][i] = current_stock
else:
test['new_sum'][i] = test['Sum'][i]
还有更好的方法吗?
答案 0 :(得分:2)
让我们看一下三种解决方案,并在最后提供性能比较。
尝试与大熊猫保持亲近的一种方法是:
def f1(df):
# Group together the elements of df.Sum that might have to be added
pos_groups = (df.Sum <= 0).cumsum()
pos_groups[df.Sum <= 0] = -1
# Create the new column and populate it with what is in df.Sum
df['new_sum'] = df.Sum
# Find the indices of the new column that need to be calculated as a sum
indices = df[df.Quantity > 0].index
for i in indices:
# Find the relevant group of positive integers to be summed, ensuring
# that we only consider those that come /before/ the one to be calculated
group = pos_groups[:i+1] == pos_groups[i]
# Zero out all the elements that will be part of the sum
df.new_sum[:i+1][group] = 0
# Calculate the actual sum and store that
df.new_sum[i] = df.Sum[:i+1][group].sum()
f1(df)
pos_groups[:i+1] == pos_groups[i]
中可能存在改进空间的地方将检查所有i+1
元素,具体时间视数据的样子而定,可能只需要检查其中的一小部分即可。但是,在实践中这仍然是更有效率的。如果没有,您可能需要手动进行迭代以找到组:
def f2(sums, quantities):
new_sums = np.copy(sums)
indices = np.where(quantities > 0)[0]
for i in indices:
a = i
while sums[a] > 0:
s = new_sums[a]
new_sums[a] = 0
new_sums[i] += s
a -= 1
return new_sums
df['new_sum'] = f2(df.Sum.values, df.Quantity.values)
最后,再次取决于您的数据的样子,很有可能可以使用Numba来改进后一种方法:
from numba import jit
f3 = jit(f2)
df['new_sum'] = f3(df.Sum.values, df.Quantity.values)
对于问题中提供的数据(可能太小而无法提供正确的图片),性能测试如下:
In [13]: %timeit f1(df)
5.32 ms ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: %timeit df['new_sum'] = f2(df.Sum.values, df.Quantity.values)
190 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each
In [18]: %timeit df['new_sum'] = f3(df.Sum.values, df.Quantity.values)
178 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
在这里,大部分时间都花在了更新数据帧上。如果数据量大了1000倍,那么Numba解决方案将成为明显的赢家:
In [28]: df_large = pd.concat([df]*1000).reset_index()
In [29]: %timeit f1(df_large)
5.82 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit df_large['new_sum'] = f2(df_large.Sum.values, df_large.Quantity.values)
6.27 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [31]: %timeit df_large['new_sum'] = f3(df_large.Sum.values, df_large.Quantity.values)
215 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)