Question

让x成为具有5个变量和15个观察值的数据集：

age gender  height  weight  fitness
17  M   5.34    68  medium
23  F   5.58    55  medium
25  M   5.96    64  high
25  M   5.25    60  medium
18  M   5.57    60  low
17  F   5.74    61  low
17  M   5.96    71  medium
22  F   5.56    75  high
16  F   5.02    56  medium
21  F   5.18    63  low
20  M   5.24    57  medium
15  F   5.47    72  medium
16  M   5.47    61  high
22  F   5.88    73  low
18  F   5.73    62  medium

适应性变量的值的频率如下：低= 4，中= 8，高= 3。

假设我有另一个数据集y，它具有相同的5个变量但有100个观察值。此数据集中适应性变量的值的频率如下：低= 42，中= 45，高= 13。

使用R，我如何从y获得代表性样本，以使样本适应度与x中适应度的分布紧密匹配？

我最初的想法是在R中使用样本函数，并为prob参数分配加权概率。但是，使用概率将强制频率分布完全匹配。我的目标是在最大样本量的同时获得足够接近的匹配。

另外，假设我想添加另一个约束，其中性别的分布也必须与x的分布紧密匹配？

Answer 1

y中的最小频率为13，对应于“高”适应度水平。因此，您采样的数量不能超过此数字。那是你的第一个约束。您想最大化样本数量，因此对所有13个样本进行采样。要匹配x中的比例，13应该是总数的20％，这意味着总数必须是65（13 / 0.2）。因此，其他频率必须为17（低）和35（中）。由于您在y中拥有足够的这些适应度水平，因此可以将其作为样本。如果其他任何采样频率超过y中的数字，那么您将受到另一个限制，因此必须相应地进行调整。

要进行抽样，首先要选择具有“高”适应性（确定抽样）的所有记录。接下来，分别从其他级别进行采样（分层随机采样）。最后，将所有三个结合起来。

示例：

rm(list=ls())
# set-up the data (your "y"):
df <- data.frame(age=round(rnorm(100, 20, 5)), 
                 gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]), 
                 height=round(rnorm(100, 12, 3)), 
                 fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)), 
                                levels=c("low","medium","high")))

创建用于采样的子集：

fit.low <- subset(df, subset=fitness=="low")
fit.medium <- subset(df, subset=fitness=="medium")
fit.high <- subset(df, subset=fitness=="high")

低适应度人群中的17个样本（占总样本的40.5％或26.7％）。

fit.low_sam <- fit.low[sample(1:42, 17),]

中度健身组的样本35（占总样本的77.8％或53.8％）。

fit.med_sam <- fit.medium[sample(1:45, 35),]

全部合并。

fit.sam <- rbind(fit.low_sam, fit.med_sam, fit.high)

我尝试使用dplyr的sample_n和sample_frac函数来执行此操作，但我认为这些函数不允许您按不同比例进行分层采样。

library(dplyr)
df %>%
  group_by(fitness) %>%
  sample_n(size=c(17,35,13), weight=c(0.27, 0.53, 0.2))
# Error

但是采样包肯定可以做到这一点。 Stratified random sampling from data frame

library(sampling)
s <- strata(df, "fitness", size=c(17,35,13), "srswor")
getdata(df, s)

Answer 2

考虑使用rmultinom来准备每个适应度级别的样本计数。

准备数据（我已经使用@ {Edward回答中的y准备）

x <- read.table(text = "age gender  height  weight  fitness
17  M   5.34    68  medium
23  F   5.58    55  medium
25  M   5.96    64  high
25  M   5.25    60  medium
18  M   5.57    60  low
17  F   5.74    61  low
17  M   5.96    71  medium
22  F   5.56    75  high
16  F   5.02    56  medium
21  F   5.18    63  low
20  M   5.24    57  medium
15  F   5.47    72  medium
16  M   5.47    61  high
22  F   5.88    73  low
18  F   5.73    62  medium", header = TRUE)

y <- data.frame(age=round(rnorm(100, 20, 5)), 
                 gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]), 
                 height=round(rnorm(100, 12, 3)), 
                 fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)), 
                                levels=c("low","medium","high")))

现在采样程序： UPD：我已经更改了两个变量大小写（性别和适应性）的代码

library(tidyverse)

N_SAMPLES = 100

# Calculate frequencies
freq <- x %>%
    group_by(fitness, gender) %>% # You can set any combination of factors
    summarise(freq = n() / nrow(x)) 

# Prepare multinomial distribution
distr <- rmultinom(N_SAMPLES, 1, freq$freq)
# Convert to counts
freq$counts <- rowSums(distr)

# Join y with frequency for further use in sampling
y_count <- y %>% left_join(freq)

# Perform sampling using multinomial distribution counts
y_sampled <- y_count %>%
    group_by(fitness, gender) %>% # Should be the same as in frequencies calculation
    # Check if count is greater then number of observations
    sample_n(size = ifelse(n() > first(counts), first(counts), n()),
        replace = FALSE) %>%
    select(-freq, -counts)

选择一个样本以匹配另一个数据集中变量的分布

2 个答案: