Question

我有两个数据框，remove和dat（实际数据框）。 remove指定dat中找到的因子变量的各种组合，以及抽样的数量（remove$cases）。

可重复的例子：

set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE), 
                  RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
                  Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
                  X=rnorm(n=1500, mean=0, sd=1),
                  y=rnorm(n=1500, mean=0, sd=1),
                  z=rnorm(n=1500, mean=0, sd=1))

我要完成的是从remove连续阅读并将其用于子集dat。我目前的做法如下：

remove <- expand.grid(RateeGender = c("Male", "Female"), 
                      RateeAgeGroup = c("18-39","40-49", "50+"),
                      Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)

# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
  selection <- character()
  # For each column of remove (particular selection):
  for (j in 1:(ncol(remove)-1)){
    add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
    selection <- paste0(selection, add)
  }
  selection <- sub(' & $', '', selection) # Remove trailing ampersand
  cat(selection, sep = "\n") # What does selection string look like?
  tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}

循环运行时来自cat()的输出看起来是正确的，例如：dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct"如果我将其粘贴到dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,]，我会得到正确的子集。

但是，如果我使用dat[selection, ]编写循环，则每个子集仅返回NAs。如果我使用subset()，我会得到相同的结果。注意，我在上面只有replace = TRUE因为随机抽样。在实际应用中，每个组合总是会有比所需更多的案例。

我知道我可以用这种方式使用lm()动态构建paste()和其他函数的公式，但显然在将其转换为使用[,]时缺少一些东西。

任何建议都会非常感激！

Answer 1

您不能使用描述的字符表达式来使用[或subset进行子集化。如果您想这样做，则必须构造整个表达式，然后使用eval。也就是说，使用merge有一个更好的解决方案。例如，让我们获取dat中与remove的前两行匹配的所有条目：

merge(dat, remove[1:2,])

如果我们想要所有与这两行不匹配的行，那么：

subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))

这假设您要在两个表中连接具有相同名称的列。如果您有大量数据，则应考虑使用data.table，因为此类操作的速度非常快。

Answer 2

在我意识到在类别的大小小于所需样本数量的情况下它没有按照您的意愿行事之前，我赞成了BrodieG的答案。（事实上，他的方法根本没有真正的抽样方法，但我认为这是对不同问题的优雅解决方案，所以我不会改变我的投票。你可以使用类似的拆分策略，如下图所示。 .frame作为输入。）。

sub <- lapply( split(dat, with(dat, paste(RateeGender,  # split vector
                                          RateeAgeGroup,
                                          Relationship, sep="_")) ), 
             function (d) { n=  with(remove, remove[
                                      RateeGender==d$RateeGender[1]&
                                      RateeAgeGroup==d$RateeAgeGroup[1]& 
                                      Relationship==d$Relationship[1], 
                                                             "cases"])
                                      cat(n); 
                                      sample(d, n, repl=TRUE) } )

使用paste为数据框子集创建逻辑表达式

2 个答案: