我有12000个csv文件,每个文件有6000行。我在代码中使用for循环,因此我认为代码执行时间增加了。如果有人知道如何将这段代码更改为pandas程序包,以减少执行时间
for i in range(len(df)):
if ((df['EOG_Start_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_start_farm'].values[i]-df['EOG_Min_Farm'].values[i])) &((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
#print('EOG')
df['EOG_flag'].values[i]=1
if ((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
#print('gust')
df['Gust_flag'].values[i]=1
注意:该代码运行良好,只是执行时间很高
答案 0 :(得分:3)
您可以使用矢量化解决方案-分别使用craete布尔掩码,通过&
链接在一起并在numpy.where
中设置值:
x = df['EOG_start_farm'].values-df['EOG_Min_Farm'].values
m1 = (df['EOG_Start_model'].values-df['EOG_Min_model'].values) < x
m2 = (df['EOG_Max_model'].values-df['EOG_Min_model'].values) < x
m3 = df['Avg'].values > 2
m23 = m2 & m3
df['EOG_flag'] = np.where(m1 & m2 & m3, 1, df['EOG_flag'].values)
df['Gust_flag'] = np.where(m2 & m3, 1, df['Gust_flag'].values)
性能:
np.random.seed(2019)
N = 6000
c = ['EOG_Start_model','EOG_Min_model','EOG_start_farm','EOG_Min_Farm','EOG_Max_model',
'EOG_Max_Farm','Avg','EOG_flag','Gust_flag']
df = pd.DataFrame(np.random.rand(N, 9), columns=c)
print (df)
In [91]: %%timeit
...: x = df['EOG_start_farm'].values-df['EOG_Min_Farm'].values
...: m1 = (df['EOG_Start_model'].values-df['EOG_Min_model'].values) < x
...: m2 = (df['EOG_Max_model'].values-df['EOG_Min_model'].values) < x
...: m3 = df['Avg'].values > 2
...: m23 = m2 & m3
...:
...: df['EOG_flag'] = np.where(m1 & m2 & m3, 1, df['EOG_flag'].values)
...: df['Gust_flag'] = np.where(m2 & m3, 1, df['Gust_flag'].values)
...:
597 µs ± 6.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [93]: %%timeit
...: for i in range(len(df)):
...: if ((df['EOG_Start_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_start_farm'].values[i]-df['EOG_Min_Farm'].values[i])) &((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
...: #print('EOG')
...: df['EOG_flag'].values[i]=1
...:
...: if ((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
...: #print('gust')
...: df['Gust_flag'].values[i]=1
231 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)