一次扫描将DataFrame分成2个

时间:2019-01-19 12:54:48

标签: python pandas

说有以下数据框:

df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,4,9,16]})

如果我必须将其分为2个数据帧,我们可以按照以下方式进行操作

df1 = df[df['B'] < 5]
df2 = df[df['B'] >= 5]

但是这里df将被扫描两次。 无论如何,有没有在单个有效行中将DataFrame分成2个;一次扫描?

编辑: 甚至@jezrael的建议也有类似的表现:

m = df['B'] < 5
#better performance with comparing numpy array
#m = df['B'].values < 5
df1 = df[m]
df2 = df[~m]  

2 个答案:

答案 0 :(得分:2)

是的,需要通过~求逆条件:

m = df['B'] < 5
#better performance with comparing numpy array
#m = df['B'].values < 5
df1 = df[m]
df2 = df[~m]

性能-所有方法与1M行相似:

np.random.seed(2019)
N = 1000000
df = pd.DataFrame({'A': np.random.randint(10, size=N),
                   'B': np.random.randint(10, size=N)})
print (df)

In [53]: %%timeit
    ...: df1 = df[df['B'] < 5]
    ...: df2 = df[df['B'] >= 5]
    ...: 
38.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [54]: %%timeit
    ...: m = df['B'] < 5
    ...: df1 = df[m]
    ...: df2 = df[~m]
    ...: 
37.3 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [55]: %%timeit
    ...: df1 = df[df['B'].values < 5]
    ...: df2 = df[df['B'].values >= 5]
    ...: 
37.8 ms ± 374 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [56]: %%timeit
    ...: m = df['B'].values < 5
    ...: df1 = df[m]
    ...: df2 = df[~m]
    ...: 
36.8 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

另一个答案的解决方案:

In [70]: %%timeit
    ...: sampled_dfs = [x for _, x in df.groupby(df['B']<5)]
    ...: df1 = sampled_dfs[0]
    ...: df2 = sampled_dfs[1]
    ...: 
76.9 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案 1 :(得分:2)

您可以按照以下说明使用groupby

df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,4,9,16]})
sampled_dfs = [x for _, x in df.groupby(df['B']<5)]
print(sampled_dfs[0])
print(sampled_df[1])

输出:

  A B
2 3 9
3 4 16

  A B
0 1 1
1 2 4