Question

我查看了Sklearn stratified sampling docs以及pandas docs以及Stratified samples from Pandas和sklearn stratified sampling based on a column，但他们没有解决此问题。

我正在寻找一种快速的pandas / sklearn / numpy方法，从数据集中生成大小为n的分层样本。但是，对于小于指定采样数的行，它应该采用所有条目。

具体例子：

谢谢！：）

Answer 1

将数字传递给样本时使用min。考虑数据框df

df = pd.DataFrame(dict(
        A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],
        B=range(10)
    ))

df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))

   A  B
1  1  1
2  1  2
3  2  3
6  2  6
7  3  7
9  4  9
8  4  8

Answer 2

扩展groupby答案，我们可以确保样本是均衡的。为此，当所有类的样本数均> = n_samples时，我们可以对所有类取n_samples（先前的答案）。当少数派类别包含<n_samples时，我们可以使所有类别的样本数与少数派类别相同。

def stratified_sample_df(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

Answer 3

以下示例总共N行，其中每个组以其原始比例出现在最接近的整数处，然后随机播放并重置索引使用：

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

又甜又甜：

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

长版

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

大熊猫的分层抽样

3 个答案: