在熊猫群中抽样

时间:2018-06-19 13:36:10

标签: python python-3.x pandas

我有一个带有用户和会话列的数据框,我想随机采样会话数,以便该数据框包含每个用户N个唯一会话。会话的顺序很重要,即每个会话的“ in”列必须保留。

例如,如果N = 2并且我有:

public <A,B> Dao<A,B> createDao(A param1, B param2) {
    return ...
}

我想要:

        x      in            session_id    user_id
0     0.0     1.0     trn-04a23351-283d       paul
1    -1.0     2.0     trn-04a23351-283d       paul
2    -1.0     3.0     trn-04a23351-283d       paul
3    -1.0     4.0     trn-04a23351-283d       paul
4    -1.0     1.0      blz-412313we-333       paul
5    -1.0     2.0      blz-412313we-333       paul
6     0.0     3.0      blz-412313we-333       paul
7    -1.0     1.0        wha-111111-fff       paul
8     0.0     2.0        wha-111111-fff       paul
9     1.0     1.0         bz-0000-01101      chris
10    0.0     2.0         bz-0000-01101      chris
11   -1.0     1.0       1111-sawas-1221      chris
12   -1.0     2.0       1111-sawas-1221      chris
13    1.0     1.0      pppppppppppppppp      chris
14    1.0     2.0      pppppppppppppppp      chris
15    1.0     3.0      pppppppppppppppp      chris
16   -1.0     1.0     55555555555555555     philip
17   -1.0     2.0     55555555555555555     philip
18   -1.0     3.0     55555555555555555     philip
19   -1.0     1.0       333333333333333     philip
20   -1.0     2.0       333333333333333     philip
21   -1.0     3.0       333333333333333     philip
22    0.0     1.0          zz-222222222     philip
23   -1.0     2.0          zz-222222222     philip
24    0.0     1.0       f-32355261-ss3d      sarah
25   -1.0     2.0       f-32355261-ss3d      sarah
26    0.0     3.0       f-32355261-ss3d      sarah
27    0.0     1.0               adasdfs      sarah
28   -1.0     2.0               adasdfs      sarah
29    0.0     3.0               adasdfs      sarah

2 个答案:

答案 0 :(得分:3)

创建要与之合并的参考数据框

d = df[['session_id', 'user_id']].drop_duplicates()
d = d.groupby('user_id', as_index=False).apply(pd.DataFrame.sample, n=2)

df.merge(d)

      x   in        session_id user_id
0  -1.0  1.0  blz-412313we-333    paul
1  -1.0  2.0  blz-412313we-333    paul
2   0.0  3.0  blz-412313we-333    paul
3  -1.0  1.0    wha-111111-fff    paul
4   0.0  2.0    wha-111111-fff    paul
5   1.0  1.0     bz-0000-01101   chris
6   0.0  2.0     bz-0000-01101   chris
7  -1.0  1.0   1111-sawas-1221   chris
8  -1.0  2.0   1111-sawas-1221   chris
9  -1.0  1.0   333333333333333  philip
10 -1.0  2.0   333333333333333  philip
11 -1.0  3.0   333333333333333  philip
12  0.0  1.0      zz-222222222  philip
13 -1.0  2.0      zz-222222222  philip
14  0.0  1.0   f-32355261-ss3d   sarah
15 -1.0  2.0   f-32355261-ss3d   sarah
16  0.0  3.0   f-32355261-ss3d   sarah
17  0.0  1.0           adasdfs   sarah
18 -1.0  2.0           adasdfs   sarah
19  0.0  3.0           adasdfs   sarah

答案 1 :(得分:1)

使用groupby + transform定义原始数据帧的掩码,然后通过该掩码对原始df进行子集化。

我使用list(set(x))来保证同一session_id不会被两次(以及replace=False)也不会被拾取。假设您希望每个session_id都有相同的出现概率,而不管它在原始df中出现了多少次。

import pandas as pd
import numpy as np

np.random.seed(123)
mask = df.groupby('user_id').session_id.transform(
           lambda x: x.isin(np.random.choice(list(set(x)), 2, replace=False)))

df[mask] 输出:

      x   in         session_id user_id
0   0.0  1.0  trn-04a23351-283d    paul
1  -1.0  2.0  trn-04a23351-283d    paul
2  -1.0  3.0  trn-04a23351-283d    paul
3  -1.0  4.0  trn-04a23351-283d    paul
7  -1.0  1.0     wha-111111-fff    paul
8   0.0  2.0     wha-111111-fff    paul
11 -1.0  1.0    1111-sawas-1221   chris
12 -1.0  2.0    1111-sawas-1221   chris
13  1.0  1.0   pppppppppppppppp   chris
14  1.0  2.0   pppppppppppppppp   chris
15  1.0  3.0   pppppppppppppppp   chris
16 -1.0  1.0  55555555555555555  philip
17 -1.0  2.0  55555555555555555  philip
18 -1.0  3.0  55555555555555555  philip
22  0.0  1.0       zz-222222222  philip
23 -1.0  2.0       zz-222222222  philip
24  0.0  1.0    f-32355261-ss3d   sarah
25 -1.0  2.0    f-32355261-ss3d   sarah
26  0.0  3.0    f-32355261-ss3d   sarah
27  0.0  1.0            adasdfs   sarah
28 -1.0  2.0            adasdfs   sarah
29  0.0  3.0            adasdfs   sarah