我的数据框中有超过500,000行,并且有许多类似的“ for”循环,这使我的代码花费了一个多小时来完成其计算。有没有一种更有效的方式来编写以下“ for”循环,从而使运行速度更快:
col_26 = []
col_27 = []
col_28 = []
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_26.append('Yes')
col_27.append('No')
col_28.append(df['A_value'][ind])
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26.append('No')
col_27.append('Yes')
col_28.append(df['B_value'][ind])
else:
col_26.append('')
col_27.append('')
col_28.append(float('nan'))
答案 0 :(得分:1)
您可能想研究pandas iterrows()函数或使用apply,也可以看一下这篇文章:https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
答案 1 :(得分:1)
尝试列操作:
data = {'A_factor': [1, 2, 3, 4, 5],
'A_value': [10, 20, 30, 40, 50],
'B_factor': [2, 3, 1, 2, 6],
'B_value': [11, 22, 33, 44, 55]}
df = pd.DataFrame(data)
df['col_26'] = ''
df['col_27'] = ''
df['col_28'] = np.nan
mask = df['A_factor'] > df['B_factor']
df.loc[mask, 'col_26'] = 'Yes'
df.loc[~mask, 'col_26'] = 'No'
df.loc[mask, 'col_28'] = df[mask]['A_value']
df.loc[~mask, 'col_27'] = 'Yes'
df.loc[mask, 'col_27'] = 'No'
df.loc[~mask, 'col_28'] = df[~mask]['B_value']
答案 2 :(得分:0)
在Python中添加列表很慢。在迭代之前初始化列表可以加快处理速度。例如,
def f():
x = []
for ii in range(500000):
x.append(str(x))
def f2():
x = [""] * 500000
for ii in range(500000):
x[ii] = str(x)
timeit.timeit("f()", "from __main__ import f", number=10)
# Output: 1.6317970999989484
timeit.timeit("f2()", "from __main__ import f2", number=10)
# Output: 1.3037318000024243
由于您已经在使用pandas / numpy,因此有多种方法可以对您的操作进行矢量化处理,因此它们不需要循环。例如:
a_factor = df["A_factor"].to_numpy()
b_factor = df["B_factor"].to_numpy()
col_26 = np.empty(a_factor.shape, dtype='U3') # U3 => string of size 3
col_27 = np.empty(a_factor.shape, dtype='U3')
col_28 = np.empty(a_factor.shape)
a_greater = a_factor > b_factor
b_greater = a_factor < b_factor
both_equal = a_factor == b_factor
col_26[a_greater] = 'Yes'
col_26[b_greater] = 'No'
col_27[a_greater] = 'Yes'
col_27[b_greater] = 'No'
col_28[a_greater] = a_factor[a_greater]
col_28[b_greater] = b_factor[b_greater]
col_28[both_equal] = np.nan
答案 3 :(得分:0)
append
使python对堆内存的请求获得更多的内存。在append
循环中使用for
会导致获取内存并不断释放它以获得更多内存。因此最好是用python说出您需要多少个项目。
col_26 = [True]*500000
col_27 = [False]*500000
col_28 = [float('nan')]*500000
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_28[ind] = df['A_value'][ind]
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26[ind] = False
col_27[ind] = True
col_28[ind] = df['B_value'][ind]
else:
col_26[ind] = ''
col_27[ind] = ''