比较值并在R中生成新列的有效方法

时间:2019-03-14 14:18:18

标签: r dataframe

我正在尝试将每行中的值与有效值(单独的列表)进行比较,如果行值与有效值不匹配,则会引发错误消息。

我能够生成所需的输出。但是,我觉得这根本不是有效的方法。

我的尝试-

set.seed(1234)
dt <- data.frame(a_check=c(20,2,1,NA,0),
                 b_check=c(0,1,NA,1,15))    

valid_values <- list(a_check= c(1,2,3), b_check= c(0,1))
param_names <- colnames(dt)

error_msg <- list()
error <- list()
for(i in 1:nrow(dt)) {      
  for(j in 1:length(param_names)) {
    if(is.na(match(as.character(unlist(dt[param_names[j]]))[i], as.character(unlist(valid_values[j]))))) {
      error_msg[j] <- paste0(toupper(param_names[j]), " must be one of the following values ", paste(unlist(valid_values[j]), collapse = '-'))

    } else {
      error_msg[j] <- NA
    }
  }
  error[i] <- paste(unlist(error_msg), collapse = " & ")
}

final_error <- unlist(error)
dt$error <- final_error

我的输出:

> dt
  a_check b_check                                                                                               error
1      20       0                                              A_CHECK must be one of the following values 1-2-3 & NA
2       2       1                                                                                             NA & NA
3       1      NA                                                NA & B_CHECK must be one of the following values 0-1
4      NA       1                                              A_CHECK must be one of the following values 1-2-3 & NA
5       0      15 A_CHECK must be one of the following values 1-2-3 & B_CHECK must be one of the following values 0-1

注意-我确实想要得到什么,但是,我不想NA & NA,也不要NA &。对于2个变量,这样做很容易。但是,我有500多个变量。

4 个答案:

答案 0 :(得分:2)

在df中添加一个检查列,并使用%in%来获取?match结果的ifelse TRUE|FALSE函数...

我喜欢@Jav的答案,如果您仅在其顶部(更准确地说是在其之前)添加一个整形,则可以将所有信息放在两列中,merge(即加入),但有错误查找表,然后将其重整为宽

示例重塑:

dt_long <- reshape(data = dt,  times = names(dt),
               direction = 'long', timevar = "type", 
               varying = list(names(dt)), idvar = "id", v.names = "values")

答案 1 :(得分:1)

使用data.table可以更加矢量化地进行操作。遍历列而不行:

> dt <- as.data.table(dt)

> dt[,  paste0(param_names, "_test") := lapply(param_names, function(x){
    get(x, dt) %in% get(x, valid_values)
})]


   a_check b_check a_check_test b_check_test
1:      20       0        FALSE         TRUE
2:       2       1         TRUE         TRUE
3:       1      NA         TRUE        FALSE
4:      NA       1        FALSE         TRUE
5:       0      15        FALSE        FALSE

编辑:将答案分配给一列:

library(magrittr)

dt[,  wrong_cols := lapply(param_names, function(x) {
    (!(get(x, dt) %in% get(x, valid_values))) %>%
      ifelse(., x, "")
  }) %>% Reduce(paste, .)]

> dt
   a_check b_check      wrong_cols
1:      20       0        a_check 
2:       2       1                
3:       1      NA         b_check
4:      NA       1        a_check 
5:       0      15 a_check b_check

EDIT_2

dt[, error := lapply(param_names, function(x) {
  ((get(x, dt) %in% get(x, valid_values))) %>%
    ifelse(., " ", paste(x, "should have valid values like -", paste(get(x, valid_values), collapse = " ")))
}) %>% Reduce(paste, .)]

> dt
   a_check b_check                                                                                     error
1:      20       0                                            a_check should have valid values like - 1 2 3 
2:       2       1                                                                                          
3:       1      NA                                               b_check should have valid values like - 0 1
4:      NA       1                                            a_check should have valid values like - 1 2 3 
5:       0      15 a_check should have valid values like - 1 2 3 b_check should have valid values like - 0 1

答案 2 :(得分:1)

这也可以。它更加简洁/高效。稍后我可以与microbenchmark进行检查,但看来您的问题已解决。

dt <- data.frame(a_check=c(20,2,1,NA,0),
                 b_check=c(0,1,NA,1,15))

valid_values <- list(a_check= c(1,2,3), b_check= c(0,1))


dt_errors <- sapply(1:ncol(dt), function(x) ifelse(!dt[[x]] %in% valid_values[[x]],
                                                   paste0(toupper(names(dt)[x]), 
                                                          " must be one of the following values: ", 
                                                          paste(valid_values[[x]], collapse = ", ")), 
                                                   ""))

dt$error <- apply(dt_errors, 1 , paste, collapse = " & ")
dt$error <- trimws(gsub("^ &|& $", "", dt$error))
dt
  a_check b_check                                                                                                    error
1      20       0                                                     A_CHECK must be one of the following values: 1, 2, 3
2       2       1                                                                                                         
3       1      NA                                                        B_CHECK must be one of the following values: 0, 1
4      NA       1                                                     A_CHECK must be one of the following values: 1, 2, 3
5       0      15 A_CHECK must be one of the following values: 1, 2, 3 & B_CHECK must be one of the following values: 0, 1

编辑:实际上,如果有两个以上的变量来删除多余的&,则可能必须调整正则表达式模式。否则,它应该很好地扩展。

添加另一个gsub语句应该可以解决问题(理论上)。

dt$error <- apply(dt_errors, 1 , paste, collapse = " & ")    
dt$error <- gsub("( & )\\1+", "\\1", dt$error)
dt$error <- gsub("^ & | & $", "", dt$error)

答案 3 :(得分:0)

library(purrr)
library(stringr)

compose_err_msg <- function(col)
  paste(toupper(col), 
        "must be one of the following values", 
        paste(valid_values[[col]], collapse = "-"))

dt$error <- 
  dt %>% 
  imap(~ ifelse(
    .x %in% valid_values[[.y]],
    list(character(0)),
    list(compose_err_msg(.y))
  )) %>% 
  transpose() %>% 
  map(lift(str_c, sep = " & ")) %>% 
  map_chr(~ if (identical(., character(0))) "" else .)

#   a_check b_check                                                                                               error
# 1      20       0                                                   A_CHECK must be one of the following values 1-2-3
# 2       2       1                                                                                                    
# 3       1      NA                                                     B_CHECK must be one of the following values 0-1
# 4      NA       1                                                   A_CHECK must be one of the following values 1-2-3
# 5       0      15 A_CHECK must be one of the following values 1-2-3 & B_CHECK must be one of the following values 0-1

请注意,我并不是说这是一种更有效或更简单的方法。显然这里发生了很多事情。

键是imap(),它同时循环遍历各列(.x)和它们的名称(.y)。

不是很重要的部分是使用stringr::str_c而不是paste来回答否"NA & NA"的约束。这样会增加额外的复杂性,需要使用character(0)并最终将其替换为""