我有一个带有用户和会话列的数据框,我想随机采样会话数,以便该数据框包含每个用户N个唯一会话。会话的顺序很重要,即每个会话的“ in”列必须保留。
例如,如果N = 2并且我有:
public <A,B> Dao<A,B> createDao(A param1, B param2) {
return ...
}
我想要:
x in session_id user_id
0 0.0 1.0 trn-04a23351-283d paul
1 -1.0 2.0 trn-04a23351-283d paul
2 -1.0 3.0 trn-04a23351-283d paul
3 -1.0 4.0 trn-04a23351-283d paul
4 -1.0 1.0 blz-412313we-333 paul
5 -1.0 2.0 blz-412313we-333 paul
6 0.0 3.0 blz-412313we-333 paul
7 -1.0 1.0 wha-111111-fff paul
8 0.0 2.0 wha-111111-fff paul
9 1.0 1.0 bz-0000-01101 chris
10 0.0 2.0 bz-0000-01101 chris
11 -1.0 1.0 1111-sawas-1221 chris
12 -1.0 2.0 1111-sawas-1221 chris
13 1.0 1.0 pppppppppppppppp chris
14 1.0 2.0 pppppppppppppppp chris
15 1.0 3.0 pppppppppppppppp chris
16 -1.0 1.0 55555555555555555 philip
17 -1.0 2.0 55555555555555555 philip
18 -1.0 3.0 55555555555555555 philip
19 -1.0 1.0 333333333333333 philip
20 -1.0 2.0 333333333333333 philip
21 -1.0 3.0 333333333333333 philip
22 0.0 1.0 zz-222222222 philip
23 -1.0 2.0 zz-222222222 philip
24 0.0 1.0 f-32355261-ss3d sarah
25 -1.0 2.0 f-32355261-ss3d sarah
26 0.0 3.0 f-32355261-ss3d sarah
27 0.0 1.0 adasdfs sarah
28 -1.0 2.0 adasdfs sarah
29 0.0 3.0 adasdfs sarah
答案 0 :(得分:3)
创建要与之合并的参考数据框
d = df[['session_id', 'user_id']].drop_duplicates()
d = d.groupby('user_id', as_index=False).apply(pd.DataFrame.sample, n=2)
df.merge(d)
x in session_id user_id
0 -1.0 1.0 blz-412313we-333 paul
1 -1.0 2.0 blz-412313we-333 paul
2 0.0 3.0 blz-412313we-333 paul
3 -1.0 1.0 wha-111111-fff paul
4 0.0 2.0 wha-111111-fff paul
5 1.0 1.0 bz-0000-01101 chris
6 0.0 2.0 bz-0000-01101 chris
7 -1.0 1.0 1111-sawas-1221 chris
8 -1.0 2.0 1111-sawas-1221 chris
9 -1.0 1.0 333333333333333 philip
10 -1.0 2.0 333333333333333 philip
11 -1.0 3.0 333333333333333 philip
12 0.0 1.0 zz-222222222 philip
13 -1.0 2.0 zz-222222222 philip
14 0.0 1.0 f-32355261-ss3d sarah
15 -1.0 2.0 f-32355261-ss3d sarah
16 0.0 3.0 f-32355261-ss3d sarah
17 0.0 1.0 adasdfs sarah
18 -1.0 2.0 adasdfs sarah
19 0.0 3.0 adasdfs sarah
答案 1 :(得分:1)
使用groupby
+ transform
定义原始数据帧的掩码,然后通过该掩码对原始df
进行子集化。
我使用list(set(x))
来保证同一session_id
不会被两次(以及replace=False
)也不会被拾取。假设您希望每个session_id
都有相同的出现概率,而不管它在原始df
中出现了多少次。
import pandas as pd
import numpy as np
np.random.seed(123)
mask = df.groupby('user_id').session_id.transform(
lambda x: x.isin(np.random.choice(list(set(x)), 2, replace=False)))
df[mask]
输出:
x in session_id user_id
0 0.0 1.0 trn-04a23351-283d paul
1 -1.0 2.0 trn-04a23351-283d paul
2 -1.0 3.0 trn-04a23351-283d paul
3 -1.0 4.0 trn-04a23351-283d paul
7 -1.0 1.0 wha-111111-fff paul
8 0.0 2.0 wha-111111-fff paul
11 -1.0 1.0 1111-sawas-1221 chris
12 -1.0 2.0 1111-sawas-1221 chris
13 1.0 1.0 pppppppppppppppp chris
14 1.0 2.0 pppppppppppppppp chris
15 1.0 3.0 pppppppppppppppp chris
16 -1.0 1.0 55555555555555555 philip
17 -1.0 2.0 55555555555555555 philip
18 -1.0 3.0 55555555555555555 philip
22 0.0 1.0 zz-222222222 philip
23 -1.0 2.0 zz-222222222 philip
24 0.0 1.0 f-32355261-ss3d sarah
25 -1.0 2.0 f-32355261-ss3d sarah
26 0.0 3.0 f-32355261-ss3d sarah
27 0.0 1.0 adasdfs sarah
28 -1.0 2.0 adasdfs sarah
29 0.0 3.0 adasdfs sarah