我正在尝试更快地获得结果(800行13分钟)。我在这里问了一个类似的问题:pandas - iterate over rows and calculate - faster-但是我无法使用好的解决方案来实现我的变化。区别在于,如果'col2'中先前值的重叠大于'n = 3',则该行中'col1'的值将设置为'0',并影响后面的代码。
import pandas as pd
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)
df["overlap_count"] = "" #create new column
n = 3 #if x >= n, then value = 0
for row in range(len(df)):
x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
df["overlap_count"].loc[row] = x
if x >= n:
df["col2"].loc[row] = 0
df["overlap_count"].loc[row] = 'x'
df
我得到以下结果:如果col1中的值大于'n'且与列overlay_count相同,则替换它们
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 x
7 49 0 x
8 50 68 2
9 50 0 x
10 52 0 x
11 55 0 x
12 56 0 x
13 69 71 0
14 70 66 1
感谢您的帮助和时间!
答案 0 :(得分:1)
我认为您可以使用numba
来提高性能,只需要使用数字值,因此可以添加x
-1
并用0
填充新列而是空字符串:
df["overlap_count"] = 0 #create new column
n = 3 #if x >= n, then value = 0
a = df[['col1','col2','overlap_count']].values
from numba import njit
@njit
def custom_sum(arr, n):
for row in range(arr.shape[0]):
x = (arr[0:row, 1] > arr[row, 0]).sum()
arr[row, 2] = x
if x >= n:
arr[row, 1] = 0
arr[row, 2] = -1
return arr
df1 = pd.DataFrame(custom_sum(a, n), columns=df.columns)
print (df1)
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 -1
7 49 0 -1
8 50 68 2
9 50 0 -1
10 52 0 -1
11 55 0 -1
12 56 0 -1
13 69 71 0
14 70 66 1
性能:
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)
#4500rows
df = pd.concat([df] * 300, ignore_index=True)
print (df)
In [115]: %%timeit
...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
...:
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [116]: %%timeit
...: for row in range(len(df)):
...: x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
...: df["overlap_count"].loc[row] = x
...:
...: if x >= n:
...: df["col2"].loc[row] = 0
...: df["overlap_count"].loc[row] = 'x'
...:
...:
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:0)
创建一个函数,然后如下所示应用该函数:
df ['overlap_count'] = [fn(i)for df ['overlap_count']]
答案 2 :(得分:0)
尝试一下,也许会更快。
df['overlap_count'] = df.groupby('col1')['col2'].transform(lambda g: len((g >= g.name).index))