说有以下数据框:
df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,4,9,16]})
如果我必须将其分为2个数据帧,我们可以按照以下方式进行操作
df1 = df[df['B'] < 5]
df2 = df[df['B'] >= 5]
但是这里df将被扫描两次。 无论如何,有没有在单个有效行中将DataFrame分成2个;一次扫描?
编辑: 甚至@jezrael的建议也有类似的表现:
m = df['B'] < 5
#better performance with comparing numpy array
#m = df['B'].values < 5
df1 = df[m]
df2 = df[~m]
答案 0 :(得分:2)
是的,需要通过~
求逆条件:
m = df['B'] < 5
#better performance with comparing numpy array
#m = df['B'].values < 5
df1 = df[m]
df2 = df[~m]
性能-所有方法与1M
行相似:
np.random.seed(2019)
N = 1000000
df = pd.DataFrame({'A': np.random.randint(10, size=N),
'B': np.random.randint(10, size=N)})
print (df)
In [53]: %%timeit
...: df1 = df[df['B'] < 5]
...: df2 = df[df['B'] >= 5]
...:
38.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [54]: %%timeit
...: m = df['B'] < 5
...: df1 = df[m]
...: df2 = df[~m]
...:
37.3 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [55]: %%timeit
...: df1 = df[df['B'].values < 5]
...: df2 = df[df['B'].values >= 5]
...:
37.8 ms ± 374 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [56]: %%timeit
...: m = df['B'].values < 5
...: df1 = df[m]
...: df2 = df[~m]
...:
36.8 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
另一个答案的解决方案:
In [70]: %%timeit
...: sampled_dfs = [x for _, x in df.groupby(df['B']<5)]
...: df1 = sampled_dfs[0]
...: df2 = sampled_dfs[1]
...:
76.9 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 1 :(得分:2)
您可以按照以下说明使用groupby
:
df = pd.DataFrame({'A': [1,2,3,4], 'B': [1,4,9,16]})
sampled_dfs = [x for _, x in df.groupby(df['B']<5)]
print(sampled_dfs[0])
print(sampled_df[1])
输出:
A B
2 3 9
3 4 16
A B
0 1 1
1 2 4