Question

我有一个称为用户的ID列表，想按80:20的百分比将它们随机分为两组。

例如，我有一个包含100个用户ID的列表，并将80个用户随机放入组1，其余20个放入组2

 def getLevelForIncrementality(Object[] args) {
   try {
     if (args.length >= 1 && args[0]!="") {
        String seed = args[0] + "Testing";
        int rnd = Math.abs(seed.hashCode() % 100);
        return (rnd >= 80 ? 2 : 1);
     }
  } catch (Exception e) { }
 return 3;
}

我已经尝试了上面的常规代码，使我的比例为82:18。

有人可以给我一些见解或建议或技巧，以解决数百万个用户ID的上述问题。

Answer 1

您可以使用random.sample随机提取所需数量的元素：

import random

a = list(range(1000))

b = random.sample(a, int(len(a) * 0.8))
len(b)

800

如果您具有唯一的ID，则可以尝试将这些ID列表转换为集合，并按以下方式区别它们：

c = list(set(a) - set(b))

Answer 2

这也适用于拆分列表：

A = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] ## Sample List
l = (len(A)/10) *8 ## making 80 %
B = A[:int(l)] ## Getting 80% of list
C = A[int(l):] ## Getting remaining list

Answer 3

为了在不创建大列表的情况下“即时”分发数据，可以使用一个小的控制列表，该列表将告诉您如何将用户分成两组（每组5个）。

spread = []
while getNextUser():
    if not spread
        spread = [1,1,1,1,0] # number of 1s and 0s is 4 vs 1 (80%)
        random.shuffle(spread)
    if spread.pop():
        # place on 80% side
    else:
        # place on 20% side

这将确保每5个用户进行完美的80:20分配，最大不平衡为4。随着处理更多用户，这种不平衡将变得越来越小。

最坏的情况：

99个用户后，为19.2％，而不是20％，则在100时修正为完美的20％
在999位用户之后达到19.9％，在达到1000位时更正为20％
9999个用户后为19.99％，校正为10000个时的完美20％

注意：您可以更改spread列表中的1和0的数目以得到不同的比例。例如[1,1,0]将给您2比1； [1,1,1,0]是3比1（75:25）； [1] * 13 + [0] * 7是13 vs 7（65:35）

您可以将其概括为一个生成器，该生成器将为您执行正确的计算和初始化：

import random
from math import gcd
def spreadRatio(a,b):
    d      = gcd(a,b) 
    base   = [True]*(a//d)+[False]*(b//d)
    spread = []
    while True:
        if not spread:
            spread = base.copy()
            random.shuffle(spread)
        yield spread.pop()


pareto = spreadRatio(80,20)
while getNextUser():
    if next(pareto):
        # place on 80% side
    else:
        # place on 20% side

Answer 4

也可以使用sklearn的train_test_split完成

import numpy as np
from sklearn.model_selection import train_test_split

X = list(np.arange(1000))

x_80_percent, x_20_percent =  train_test_split(X, test_size =.20, shuffle  = True)

以80:20的百分比将用户随机分为两组

4 个答案: