如果用行替换NAs意味着如果行式NA的比例低于某个阈值?

时间:2017-08-10 14:28:50

标签: r if-statement dplyr missing-data

为这个有点麻烦的问题道歉,但我目前正致力于心理健康研究。对于其中一个心理健康筛查工具,有15个变量,每个变量的值可以为0-3。然后通过获取这15个变量的总和来分配每行/参与者的总分。此工具的文档指出,如果缺少特定行/参与者的超过20%的值,则总得分也应视为缺失,但如果缺少少于20%的行,则每个应为缺失值指定该行剩余值的平均值。

我决定要这样做,我必须计算每个参与者的NA的比例,计算每个参与者排除NA的所有15个变量的平均值,然后使用检查是否的条件变异语句(或类似的东西)在找到每行的所有15个变量的总和之前,NAs的比例小于20%并且如果这样替换具有该行的平均值的相关列的NAs。除了这15个数据集之外,数据集还包含其他列,因此将函数应用于所有列将没有用。

为了计算没有NA的平均分数,我做了以下几点:

mental$somatic_mean <- rowMeans(mental [, c("var1", "var2", "var3", 
"var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12","var13", "var14", "var15")], na.rm=TRUE)

计算每个变量的NA比例:

mental$somatic_na <- rowMeans(is.na(mental [, c("var1", "var2", 
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12", "var13", "var14", "var15")]))

然而,当我尝试使用mutate()语句来改变少于20%的值为NA的行时,我无法识别任何有效的代码。到目前为止,我已经尝试了很多排列,包括每个变量的以下内容:

mental_recode <- mental %>%
  rowwise() %>%
  mutate(var1 = if(somatic_na<0.2) 
  replace_na(list(var1= somatic_mean)))

返回:

"no applicable method for 'replace_na' applied to an object of class "list""

尝试在不使用mutate()的情况下一起完成它们:

mental %>%
  rowwise() %>%
  if(somatic_na<0.2)
                     replace_na(list(var1 = somatic_mean,   var2= 
somatic_mean,   var3 = somatic_mean,   var4 = somatic_mean,  var5 = 
somatic_mean,  var6 = somatic_mean,  var7 = somatic_mean, var8 = 
somatic_mean,  var9 = somatic_mean,  var10 = somatic_mean,   var11 = 
somatic_mean,  var12 = somatic_mean,   var13 = somatic_mean,  var14 = 
somatic_mean,  var15 = somatic_mean )) 

返回:

Error in if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  : 
  argument is not interpretable as logical
In addition: Warning message:
In if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  :
  the condition has length > 1 and only the first element will be used

我还尝试将if_else()与mutate()结合使用,如果条件不满足则将值设置为NA,但在各种排列和错误消息之后无法使其工作。

编辑:虚拟数据可以通过以下方式生成:

mental <- structure(list(id = 1:21, var1 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L, 
                               NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, 0L), var2 = c(0L, 
 0L, 1L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
2L, 0L, 1L, 1L), var3 = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L, 
1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 2L, 0L, 1L, 1L), var4 = c(1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 1L, 0L, 0L), var5 = c(0L, 0L, 0L, 1L, NA, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), var6 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, NA, 0L), var8 = c(0L, 
0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var9 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), var10 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, NA, 0L), var11 = c(1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, NA, 0L), var12 = c(1L, 
0L, 1L, 1L, NA, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 1L, 1L), var13 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 
0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, NA, 0L), var14 = c(1L, 
0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 
2L, 0L, 1L, 0L), var15 = c(1L, 0L, 2L, NA, NA, 0L, NA, 0L, 0L, 
0L, 0L, 0L, NA, NA, 0L, NA, NA, NA, NA, NA, 0L)), .Names = c("id", 
"var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8", 
"var9", "var10", "var11", "var12", "var13", "var14", "var15"), class =                                 
"data.frame", row.names = c(NA, 
-21L))

有没有人知道适用于这种情况的代码?

提前致谢!

2 个答案:

答案 0 :(得分:1)

以下是使用您提供的数据框使用dplyr在一个链中完成所有操作的方法。

首先创建一个感兴趣的所有列名称的向量:

name_col <- colnames(mental)[2:16]

现在使用dplyr

library(dplyr)

mental %>% 
  # First create the column of row means
  mutate(somatic_mean = rowMeans(.[name_col], na.rm = TRUE)) %>% 
  # Now calculate the proportion of NAs
  mutate(somatic_na = rowMeans(is.na(.[name_col]))) %>% 
  # Create this column for filtering out later
  mutate(somatic_usable = ifelse(somatic_na < 0.2,
                                 "yes", "no")) %>% 
  # Make the following replacement on a row basis 
  rowwise() %>%
  mutate_at(vars(name_col), # Designate eligible columns to check for NAs
            funs(replace(., 
                         is.na(.) & somatic_na < 0.2, # Both conditions need to be met
                         somatic_mean))) %>% # What we are subbing the NAs with
  ungroup() # Now ungroup the 'rowwise' in case you need to modify further

现在,如果您只想选择NAs少于20%的条目,您可以将上述内容输入以下内容:

filter(somatic_usable == "yes")

另外值得注意的是,如果您想要使条件小于或等于 20%,则需要将somatic_na < 0.2替换为somatic_na <= 0.2。< / p>

希望这有帮助!

答案 1 :(得分:0)

这是一种仅使用基本R表达式并记住总和的数学属性的方法:

# generate fake data
set.seed(123)

dat <- data.frame(
  ID = 1:10,
  matrix(sample(c(0:3, NA), 10 * 15, TRUE), nrow = 10, ncol = 15),
  'another_var' = 'foo',
  'second_var' = 'bar',
  stringsAsFactors = FALSE
)

var_names <- paste0('X', 1:15)

# add number of NAs to data

dat$na_num <- rowSums(is.na(dat[var_names]))

# add row sum

dat$row_sum <- rowSums(dat[var_names], na.rm = TRUE) 

# add row mean

dat$row_mean <- rowMeans(dat[var_names], na.rm = TRUE)

# add final sum

dat$final_sum <- dat$row_sum + dat$row_mean * dat$na_num

# recode final sum to be NA if prop > .2

dat$final_sum <- ifelse(rowMeans(is.na(dat[var_names])) > .2,
                        NA,
                        dat$final_sum)

这是一个做同样事情的功能。在哪里指定data,然后指定变量名称的字符向量。

total_sum_calculation <- function(data, var_names){
  # add number of NAs to data
  na_num <- rowSums(is.na(data[var_names]))

  # add row sum
  row_sum <- rowSums(data[var_names], na.rm = TRUE) 

  # add row mean
  row_mean <- rowMeans(data[var_names], na.rm = TRUE)

  # add final sum
  final_sum <- row_sum + row_mean * na_num

  # recode final sum to be NA if prop > .2
  ifelse(rowMeans(is.na(data[var_names])) > .2,
                          NA,
                          final_sum)

}

v_names <- paste0('var', 1:15)
total_sum_calculation(data = mental, var_names = v_names)

 [1] 6.000000 0.000000 8.000000 7.500000       NA 0.000000 3.214286 9.230769 6.000000 2.000000 1.000000 0.000000 4.285714
[14]       NA 5.357143 5.357143 5.357143 9.642857 1.071429       NA 3.000000