我正在尝试将每行中的值与有效值(单独的列表)进行比较,如果行值与有效值不匹配,则会引发错误消息。
我能够生成所需的输出。但是,我觉得这根本不是有效的方法。
我的尝试-
set.seed(1234)
dt <- data.frame(a_check=c(20,2,1,NA,0),
b_check=c(0,1,NA,1,15))
valid_values <- list(a_check= c(1,2,3), b_check= c(0,1))
param_names <- colnames(dt)
error_msg <- list()
error <- list()
for(i in 1:nrow(dt)) {
for(j in 1:length(param_names)) {
if(is.na(match(as.character(unlist(dt[param_names[j]]))[i], as.character(unlist(valid_values[j]))))) {
error_msg[j] <- paste0(toupper(param_names[j]), " must be one of the following values ", paste(unlist(valid_values[j]), collapse = '-'))
} else {
error_msg[j] <- NA
}
}
error[i] <- paste(unlist(error_msg), collapse = " & ")
}
final_error <- unlist(error)
dt$error <- final_error
我的输出:
> dt
a_check b_check error
1 20 0 A_CHECK must be one of the following values 1-2-3 & NA
2 2 1 NA & NA
3 1 NA NA & B_CHECK must be one of the following values 0-1
4 NA 1 A_CHECK must be one of the following values 1-2-3 & NA
5 0 15 A_CHECK must be one of the following values 1-2-3 & B_CHECK must be one of the following values 0-1
注意-我确实想要得到什么,但是,我不想NA & NA
,也不要NA &
。对于2个变量,这样做很容易。但是,我有500多个变量。
答案 0 :(得分:2)
在df中添加一个检查列,并使用%in%
来获取?match
结果的ifelse
TRUE|FALSE
函数...
我喜欢@Jav的答案,如果您仅在其顶部(更准确地说是在其之前)添加一个整形,则可以将所有信息放在两列中,merge
(即加入),但有错误查找表,然后将其重整为宽
示例重塑:
dt_long <- reshape(data = dt, times = names(dt),
direction = 'long', timevar = "type",
varying = list(names(dt)), idvar = "id", v.names = "values")
答案 1 :(得分:1)
使用data.table
可以更加矢量化地进行操作。遍历列而不行:
> dt <- as.data.table(dt)
> dt[, paste0(param_names, "_test") := lapply(param_names, function(x){
get(x, dt) %in% get(x, valid_values)
})]
a_check b_check a_check_test b_check_test
1: 20 0 FALSE TRUE
2: 2 1 TRUE TRUE
3: 1 NA TRUE FALSE
4: NA 1 FALSE TRUE
5: 0 15 FALSE FALSE
编辑:将答案分配给一列:
library(magrittr)
dt[, wrong_cols := lapply(param_names, function(x) {
(!(get(x, dt) %in% get(x, valid_values))) %>%
ifelse(., x, "")
}) %>% Reduce(paste, .)]
> dt
a_check b_check wrong_cols
1: 20 0 a_check
2: 2 1
3: 1 NA b_check
4: NA 1 a_check
5: 0 15 a_check b_check
EDIT_2
dt[, error := lapply(param_names, function(x) {
((get(x, dt) %in% get(x, valid_values))) %>%
ifelse(., " ", paste(x, "should have valid values like -", paste(get(x, valid_values), collapse = " ")))
}) %>% Reduce(paste, .)]
> dt
a_check b_check error
1: 20 0 a_check should have valid values like - 1 2 3
2: 2 1
3: 1 NA b_check should have valid values like - 0 1
4: NA 1 a_check should have valid values like - 1 2 3
5: 0 15 a_check should have valid values like - 1 2 3 b_check should have valid values like - 0 1
答案 2 :(得分:1)
这也可以。它更加简洁/高效。稍后我可以与microbenchmark
进行检查,但看来您的问题已解决。
dt <- data.frame(a_check=c(20,2,1,NA,0),
b_check=c(0,1,NA,1,15))
valid_values <- list(a_check= c(1,2,3), b_check= c(0,1))
dt_errors <- sapply(1:ncol(dt), function(x) ifelse(!dt[[x]] %in% valid_values[[x]],
paste0(toupper(names(dt)[x]),
" must be one of the following values: ",
paste(valid_values[[x]], collapse = ", ")),
""))
dt$error <- apply(dt_errors, 1 , paste, collapse = " & ")
dt$error <- trimws(gsub("^ &|& $", "", dt$error))
dt
a_check b_check error
1 20 0 A_CHECK must be one of the following values: 1, 2, 3
2 2 1
3 1 NA B_CHECK must be one of the following values: 0, 1
4 NA 1 A_CHECK must be one of the following values: 1, 2, 3
5 0 15 A_CHECK must be one of the following values: 1, 2, 3 & B_CHECK must be one of the following values: 0, 1
编辑:实际上,如果有两个以上的变量来删除多余的&
,则可能必须调整正则表达式模式。否则,它应该很好地扩展。
添加另一个gsub
语句应该可以解决问题(理论上)。
dt$error <- apply(dt_errors, 1 , paste, collapse = " & ")
dt$error <- gsub("( & )\\1+", "\\1", dt$error)
dt$error <- gsub("^ & | & $", "", dt$error)
答案 3 :(得分:0)
library(purrr)
library(stringr)
compose_err_msg <- function(col)
paste(toupper(col),
"must be one of the following values",
paste(valid_values[[col]], collapse = "-"))
dt$error <-
dt %>%
imap(~ ifelse(
.x %in% valid_values[[.y]],
list(character(0)),
list(compose_err_msg(.y))
)) %>%
transpose() %>%
map(lift(str_c, sep = " & ")) %>%
map_chr(~ if (identical(., character(0))) "" else .)
# a_check b_check error
# 1 20 0 A_CHECK must be one of the following values 1-2-3
# 2 2 1
# 3 1 NA B_CHECK must be one of the following values 0-1
# 4 NA 1 A_CHECK must be one of the following values 1-2-3
# 5 0 15 A_CHECK must be one of the following values 1-2-3 & B_CHECK must be one of the following values 0-1
请注意,我并不是说这是一种更有效或更简单的方法。显然这里发生了很多事情。
键是imap()
,它同时循环遍历各列(.x
)和它们的名称(.y
)。
不是很重要的部分是使用stringr::str_c
而不是paste
来回答否"NA & NA"
的约束。这样会增加额外的复杂性,需要使用character(0)
并最终将其替换为""
。