Question

我在DataFrame中有一个ID（设备ID）列表。我想随机分配A或B到这些设备中的每一个（将它们分成两半）：

假设我们有一个名为devices的DataFrame，其中包含"DeviceId"列和9364957行。

选项1：

def coin():
    p = 0.5

    r = np.random.random()
    return 'B' if r > p else 'A'


devices['Experiment'] = pd.DataFrame( [coin() for i in range(devices.shape[0])])

g = devices.groupby(['Experiment']).agg(['count'])
print(g.head(10))

输出：

实验
  A 4681923

B 4683034

A中的条目数多于B中的1,111个条目！

选项2 :(我被卡住了:(）

A = devices.sample(frac=0.5, replace=False)
print('\tselect as A: ', A.shape[0])

选择A：4682478

在一个简单的计算中，这是更好的分裂，因为这种方式B将得到4682479（两者之间的增量正好为1）

但我怎么能从这里开始呢？

我的目标是获取包含两列的更新的DataFrame devices： DeviceId，Experiment（"A"或"B"

Answer 1

有许多方法可以为实验创建新列。以下是一种基于您正在使用的import pandas as pd import numpy as np # some sample data devices = pd.DataFrame({'DeviceId': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # initialize experiment column to 'A' devices['Experiment'] = 'A' # use the index of a sampling to change ~50% of the labels to 'B' devices.loc[devices.sample(frac=0.5, replace=False).index,'Experiment'] = 'B'函数构建的方法：

>>> devices.groupby('Experiment').count()

Experiment  DeviceId
A           5
B           6

在我的例子中，我得到了以下数据：

coin()

或者，如果你想使用你的apply()功能 - 你可以尝试devices['Experiment'] = devices.apply(lambda x: coin(), axis=1)（虽然这可能更慢）：

另外，在A中比A=0.49994068312326473 B=0.5000593168767353多1,111个条目不是错误 - 它是生成（伪）随机数的结果。将这些数字视为人口的百分比：

CREATE CONSTRAINT ON ( action:Action ) ASSERT action.id IS UNIQUE

非常接近您想要的50/50分割。您绘制的样本越多，您越接近50/50（理论上）。

Python：将数据帧随机分成两半，并在新列中指定值

1 个答案: