在数据帧上执行for循环的更快替代方法?

时间:2019-04-06 06:39:24

标签: python pandas

我有一个数据帧df,它有1000万行。我正在运行下面的循环,这需要大量时间才能执行。可以有更快的方法来完成相同的任务吗?

for i in range(len(df)):
    if df['col_1'][i] not in ['a', 'b']:
        df.at[i,'col_1'] = np.nan

2 个答案:

答案 0 :(得分:2)

尝试一下:

df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan

答案 1 :(得分:1)

要获得更好的性能,请使用numpy.where,并将Series.values的值转换为1d array

df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  
                       df['col_1'].values, 
                       np.nan)

#pandas 0.24+
df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  
                       df['col_1'].to_numpy(), 
                       np.nan)

测试1%个值中的a,b

np.random.seed(2019)
N = 10 ** 7
df = pd.DataFrame({'col_1':np.random.choice(['a','b','c'], p=(.05,.05,.9),size=N)})
#print (df)

In [87]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  df['col_1'].values, np.nan)
425 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [88]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  df['col_1'], np.nan)
442 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [89]: %timeit df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan
537 ms ± 4.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

测试50%个值中的a,b

np.random.seed(2019)
N = 10 ** 7
df = pd.DataFrame({'col_1':np.random.choice(['a','b','c'], p=(.25,.25,.5),size=N)})
print (df)

In [101]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  df['col_1'].values, np.nan)
532 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [102]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  df['col_1'], np.nan)
533 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [103]: %timeit df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan
602 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

测试90%个值中的a,b

np.random.seed(2019)
N = 10 ** 7
df = pd.DataFrame({'col_1':np.random.choice(['a','b','c'], p=(.45,.45,.1),size=N)})
print (df)


In [106]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  df['col_1'].values, np.nan)
517 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),  df['col_1'], np.nan)
520 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [108]: %timeit df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan
557 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)