Python:从数据框熊猫获取随机数据

时间:2020-10-29 19:46:18

标签: python python-3.x pandas dataframe random

具有df值:

name     algo      accuracy
tom       1         88
tommy     2         87
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

如何从df中随机选择4条记录,条件是应从每个唯一的算法列值中至少选择一条记录。在这里,算法列只有3个唯一值(1、2、3)

样本输出:

name     algo      accuracy
tom       1         88
tommy     2         87
stuart    3         100
lincoln   1         88

样本输出2:

name     algo      accuracy
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

1 个答案:

答案 0 :(得分:3)

一种方式

num_sample, num_algo = 4, 3

# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)

# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )

另一种方法是重新整理整个数据,枚举每个算法中的行,按该枚举排序并获取所需数量的样本。这比第一种方法的代码略多,但是更便宜,并且产生的算法计数更均衡:

# shuffle data
df_random = df['algo'].sample(frac=1)

# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()

# sort with `np.argsort`:
enums = enums.sort_values()

# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]