我正在使用不平衡的面板数据,我希望从中抽取随机样本,该样本不受每单位不同观察数量的影响。例如,在下面的代码中,IBM被选中的可能性是GOOG的两倍,被选中的可能性是MSFT的五倍。有没有办法对这些数据进行抽样,好像每个公司/年都有相同的被选中概率?可能使用抽样包?
df <- data.frame(COMPANY=c(rep('IBM',50),rep('GOOG',25),rep('MSFT',10)), YEAR=c(1961:2010,1988:2012,1996:2005), PROFIT=rnorm(85))
df
df[sample(nrow(df), 20, replace=FALSE), ]
答案 0 :(得分:3)
以下是您可以做的事情:
probs <- 1 / table(df$COMPANY)[df$COMPANY]
df[sample(nrow(df), 20, replace = FALSE, prob = probs), ]
让我们测试一下:
table(df[sample(nrow(df), 1e6, replace = TRUE, prob = probs), "COMPANY"])
# GOOG IBM MSFT
# 333499 333080 333421
我们不是将每一行的概率等于1 /(50 + 25 + 10),而是将它们标准化,以便每个公司都有相同的选择概率:
tapply(probs, df$COMPANY, sum)
# GOOG IBM MSFT
# 1 1 1
(probs
总和为3而不是1,但sample
负责处理。为了使数学更清晰,让我们举一个简单的例子(这也不总和为1,但这不是问题):
vec <- c(1, 1, 2)
as.vector(1 / table(vec)[vec])
# [1] 0.5 0.5 1.0
答案 1 :(得分:1)
我只是一个新的R用户,但这是我的解决方案:
加载示例数据(基于PSID)。数据是不平衡的小组数据:1977年至1983年间有98个个人观察,15个小组,有性别鉴定(未使用)
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L,5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L,13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 15L,15L, 15L), year = c(1978L, 1979L, 1980L, 1981L, 1982L, 1983L,1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L,1979L, 1980L, 1981L, 1982L, 1983L, 1979L, 1977L, 1978L, 1979L,1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L,1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L,1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L,1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L,1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L,1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L,1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L,1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L,1982L, 1983L), gender = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("id", "year","gender"), row.names = c(NA, 98L), class = "data.frame")
创建数据框,每个组ID有1个观察值(在本例中,有15个不同的组)
sample <- select(df, id) %>% group_by(id) %>% sample_n(1)
创建15个随机观察的样本
sample <- ungroup(sample) %>% sample_n(5) %>% mutate(id=row_number())
合并m:1个旧数据帧和样本数据帧
df_new <- merge(x = df, y = sample, by = "id", all.y = TRUE)