两组变量

时间:2017-10-30 11:59:09

标签: r

我有一个包含两组变量的数据框,基线变量(var1-var5)和后续变量(tar1-tar5),以及一个ID变量。这个数据框也有一些遗漏。

我希望能够得到所有观察值的计数和比例:一个或多个var1-var5 变量的值 AND 一个或多个tar1-tar5 变量。在下面的mwe中,观察02-06而不是观察01符合我的标准,所以我期望计数为5,比例为0.83。

我获得的计数和比例的代码只给出了对任何变量都没有遗漏的观察,我无法弄清楚如何改变它。任何帮助将非常感激。

起点:

df <- data.frame(ID=c("01","02","03","04","05","06"),var1=c(NA,"a","a","a","a","a"),var2=c(NA,NA,"b","b","b","b"),var3=c(NA,NA,NA,"c","c","c"),var4=c(NA,NA,NA,NA,"d","d"),var5=c(NA,NA,NA,NA,NA,"e"),tar1=c(NA,"a","a","a","a","a"),tar2=c(NA,NA,"b","b","b","b"),tar3=c(NA,NA,NA,"c","c","c"),tar4=c(NA,NA,NA,NA,"d","d"),tar5=c(NA,NA,NA,NA,NA,"e"), stringsAsFactors = FALSE)

当前代码:

sum(!is.na(df$var1) & !is.na(df$var2) & !is.na(df$var3) & !is.na(df$var4) & !is.na(df$var5) & !is.na(df$tar1) & !is.na(df$tar2) & !is.na(df$tar3) & !is.na(df$tar4) & !is.na(df$tar5))   

sum(!is.na(df$var1) & !is.na(df$var2) & !is.na(df$var3) & !is.na(df$var4) & !is.na(df$var5) & !is.na(df$tar1) & !is.na(df$tar2) & !is.na(df$tar3) & !is.na(df$tar4) & !is.na(df$tar5))/nrow(df)   

2 个答案:

答案 0 :(得分:1)

根据描述,似乎我们正在寻找至少一个&#39; var&#39;那是非NA和&#39; tar&#39;作为非NA。在这种情况下,我们可以使用&#39; var&#39;来对数据集进行子集化。列,将其转换为逻辑矩阵(!is.na(df[grep(...)])),获取rowSums并检查我们是否至少有一个非NA元素,即rowSums(...)>0。用&#39; tar&#39;完成相同的程序。数据子集,使用&检查两者是否&#39; var&#39;并且&#39; tar&#39;列具有非NA元素

i1 <- rowSums(!is.na(df[grep("var", names(df))]))>0 &  
                   rowSums(!is.na(df[grep("tar", names(df))]))>0
i1
#[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

sum(i1)
#[1] 5
mean(i1)
#[1] 0.8333333

答案 1 :(得分:0)

有更多的单词(我六个月后回到我的代码时总是很感激),但更通用:我们创建一个函数,返回哪些行至少有一些不是NA的逻辑。

df <- data.frame(ID=c("01","02","03","04","05","06"),var1=c(NA,"a","a","a","a","a"),var2=c(NA,NA,"b","b","b","b"),var3=c(NA,NA,NA,"c","c","c"),var4=c(NA,NA,NA,NA,"d","d"),var5=c(NA,NA,NA,NA,NA,"e"),tar1=c(NA,"a","a","a","a","a"),tar2=c(NA,NA,"b","b","b","b"),tar3=c(NA,NA,NA,"c","c","c"),tar4=c(NA,NA,NA,NA,"d","d"),tar5=c(NA,NA,NA,NA,NA,"e"), stringsAsFactors = FALSE)

# Does the actual work
whichRowsHaveValues <- function(dataFrame) {
  which <- apply(X=dataFrame, FUN=function(x) any(!is.na(x)), MARGIN=1)
  return(which)
}

# Choose columns
varCols <- grepl(x=colnames(df), pattern="var", fixed=TRUE)
tarCols <- grepl(x=colnames(df), pattern="tar", fixed=TRUE)

# Find the answer
goodRows <- whichRowsHaveValues(df[varCols]) & 
    whichRowsHaveValues(df[tarCols])
count <- sum(goodRows)
proportion <- count/nrow(df)

当然,您可能不希望它相当如此冗长,我们并不真正需要所有这些中间变量。 。 。