小猪退出我之前的问题python pandas: assign control vs. treatment groupings randomly based on %
感谢@maxU,我知道如何将随机控制/治疗分组分配到2组;但如果我有3组或更多组怎么办?
例如:
df.head()
customer_id | Group | many other columns
ABC 1
CDE 3
BHF 2
NID 1
WKL 3
SDI 2
JSK 1
OSM 3
MPA 2
MAD 1
pd.pivot_table(df,index=['Group'],values=["customer_id"],aggfunc=lambda x: len(x.unique()))
Group 1 : 270
Group 2 : 180
Group 3 : 330
我有一个很好的答案,当我只有两个小组时:
df['Flag'] = df.groupby('Group')['customer_id']\
.transform(lambda x: np.random.choice(['Control','Test'], len(x),
p=[.5,.5] if x.name==1 else [.4,.6]))
但是,如果我想以这种方式拆分它:
@ MaxU的答案很棒,但遗憾的是分裂并不准确
d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}
df['Flag'] = df.groupby('Group')['customer_id'] \
.transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))
当我测试它时,我没有得到确切的分裂。
pd.pivot_table(df,index=['Group'],values=["customer_id"],columns=['Flag'], aggfunc=lambda x: len(x.unique()))
Control Treatment
Group 1: 138 132
Group 2: 78 102
Group 3: 79 251
第1组应为135/135。
答案 0 :(得分:2)
In [13]: df
Out[13]:
customer_id Group
0 ABC 1
1 CDE 3
2 BHF 2
3 NID 1
4 WKL 3
5 SDI 2
6 JSK 1
7 OSM 3
8 MPA 2
9 MAD 1
In [14]: d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}
In [15]: df['Flag'] = \
...: df.groupby('Group')['customer_id'] \
...: .transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))
...:
In [16]: df
Out[16]:
customer_id Group Flag
0 ABC 1 Control
1 CDE 3 Test
2 BHF 2 Test
3 NID 1 Control
4 WKL 3 Control
5 SDI 2 Test
6 JSK 1 Test
7 OSM 3 Test
8 MPA 2 Control
9 MAD 1 Test
答案 1 :(得分:1)
听起来你正在寻找一种方法将customer_id
分成精确的比例,而不是依靠机会。以下是使用pandas.qcut
和np.random.permutation
进行此操作的一种方法。
In [228]: df = pd.DataFrame({'customer_id': np.random.normal(size=10000),
'group': np.random.choice(['a', 'b', 'c'], size=10000)})
In [229]: proportions = {'a':[.5,.5], 'b':[.4,.6], 'c':[.2,.8]}
In [230]: df.head()
Out[230]:
customer_id group
0 0.6547 c
1 1.4190 a
2 0.4205 a
3 2.3266 a
4 -0.5691 b
In [231]: def assigner(gp):
...: group = gp['group'].iloc[0]
...: cut = pd.qcut(
np.arange(gp.shape[0]),
q=np.cumsum([0] + proportions[group]),
labels=range(len(proportions[group]))
).get_values()
...: return pd.Series(cut[np.random.permutation(gp.shape[0])], index=gp.index, name='assignment')
...:
In [232]: df['assignment'] = df.groupby('group', group_keys=False).apply(assigner)
In [233]: df.head()
Out[233]:
customer_id group assignment
0 0.6547 c 1
1 1.4190 a 1
2 0.4205 a 0
3 2.3266 a 1
4 -0.5691 b 0
In [234]: (df.groupby(['group', 'assignment'])
.size()
.unstack()
.assign(proportion=lambda x: x[0] / (x[0] + x[1])))
Out[234]:
assignment 0 1 proportion
group
a 1659 1658 0.5002
b 1335 2003 0.3999
c 669 2676 0.2000
这里发生了什么?
assigner
assigner
从预定义词典中获取组名称和比例,并调用pd.qcut
分为0(控制)1(处理)np.random.permutation
然后随机播放作业