用%in%
合并几列后,丢失数据中的NA仍然保留在我的字符向量中,这是我不希望的。
我每行有一系列医疗诊断(每列1个),并且希望通过 via。 grepl()
和library(dplyr)
library(tidyr)
df <- data_frame(a = paste0("A.", rep(1, 3)), b = " ", c = c("C.1", "C.3", " "), d = "D.4", e = "E.5")
cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
tidyr::unite(df, new, cols, sep = ",")
进行一系列代码的基准测试。 / p>
Github上有一个未解决的问题,是否有任何移动-或变通方法?我想让向量保持逗号分隔。
这是一个代表性的例子:
# # A tibble: 3 x 3
# a new e
# <chr> <chr> <chr>
# 1 A.1 NA,C.1,D.4 E.5
# 2 A.1 NA,C.3,D.4 E.5
# 3 A.1 NA,NA,D.4 E.5
当前输出:
# # A tibble: 3 x 3
# a new e
# <chr> <chr> <chr>
# 1 A.1 C.1,D.4 E.5
# 2 A.1 C.3,D.4 E.5
# 3 A.1 D.4 E.5
所需的输出:
{{1}}
答案 0 :(得分:4)
创建NA后,可以使用正则表达式删除它们:
library(dplyr)
library(tidyr)
df <- data_frame(a = paste0("A.", rep(1, 3)),
b = " ",
c = c("C.1", "C.3", " "),
d = "D.4", e = "E.5")
cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
tidyr::unite(df, new, cols, sep = ",") %>%
dplyr::mutate(new = stringr::str_replace_all(new, 'NA,?', '')) # New line
输出:
# A tibble: 3 x 3
a new e
<chr> <chr> <chr>
1 A.1 C.1,D.4 E.5
2 A.1 C.3,D.4 E.5
3 A.1 D.4 E.5
答案 1 :(得分:3)
您可以通过遍历行来避免插入它们:
library(tidyverse)
df <- data_frame(
a = c("A.1", "A.1", "A.1"),
b = c(NA_character_, NA_character_, NA_character_),
c = c("C.1", "C.3", NA),
d = c("D.4", "D.4", "D.4"),
e = c("E.5", "E.5", "E.5")
)
cols <- letters[2:4]
df %>% mutate(x = pmap_chr(.[cols], ~paste(na.omit(c(...)), collapse = ',')))
#> # A tibble: 3 x 6
#> a b c d e x
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A.1 <NA> C.1 D.4 E.5 C.1,D.4
#> 2 A.1 <NA> C.3 D.4 E.5 C.3,D.4
#> 3 A.1 <NA> <NA> D.4 E.5 D.4
或使用tidyr
的基础stringi
包,
df %>% mutate(x = pmap_chr(.[cols], ~stringi::stri_flatten(
c(...), collapse = ",",
na_empty = TRUE, omit_empty = TRUE
)))
#> # A tibble: 3 x 6
#> a b c d e x
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A.1 <NA> C.1 D.4 E.5 C.1,D.4
#> 2 A.1 <NA> C.3 D.4 E.5 C.3,D.4
#> 3 A.1 <NA> <NA> D.4 E.5 D.4
问题在于,遍历行通常需要进行 lot 个调用,因此在规模上可能很慢。不幸的是,似乎没有一个很好的矢量化替代方法可以在加入字符串之前删除NA
。
答案 2 :(得分:3)
如果您安装tidyr
的开发版本,则现在可以添加na.rm
参数来删除NA
。 The issue is now closed。
devtools::install_github("tidyverse/tidyr")
library(tidyr)
df %>% unite(new, cols, sep = ",", na.rm = TRUE)
# a new e
# <chr> <chr> <chr>
#1 A.1 C.1,D.4 E.5
#2 A.1 C.3,D.4 E.5
#3 A.1 D.4 E.5
您也可以使用基本R apply
方法。
apply(df[cols], 1, function(x) toString(na.omit(x)))
#[1] "C.1, D.4" "C.3, D.4" "D.4"
数据
df <- data_frame(
a = c("A.1", "A.1", "A.1"),
b = c(NA_character_, NA_character_, NA_character_),
c = c("C.1", "C.3", NA),
d = c("D.4", "D.4", "D.4"),
e = c("E.5", "E.5", "E.5")
)
cols <- letters[2:4]
答案 3 :(得分:2)
谢谢,我整理了解决方案的摘要,并在数据上标出了基准:
library(microbenchmark)
library(dplyr)
library(stringr)
library(tidyr)
library(biometrics) # has my helper function for column selection
cols <- biometrics::variables(c("diagnosis", "dagger", "ediag"), 20)
system.time({
df <- dat[, cols]
df <- gsub(" ", NA_character_, as.matrix(df)) %>% tbl_df()
})
microbenchmark(
## search by base R `match()` function
match_spaces = apply(dat, 1, function(x) any(c("A37.0","A37.1","A37.8","A37.9") %in% x[cols])), # original search (match)
match_NAs = apply(df, 1, function(x) any(c("A37.0","A37.1","A37.8","A37.9") %in% x[cols])), # matching with " " replaced by NAs with gsub
## search by base R 'grep()' function - the same regex is used in each case
regex_str_replace_all = tidyr::unite(df, new, cols, sep = ",") %>% # grepl search with NAs removed with `stringr::str_replace_all()`
mutate(new = str_replace_all(new, "NA,?", "")) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
regex_toString = tidyr::unite(df, new, cols, sep = ",") %>% # grepl search with NAs removed with `apply()` & `toString()`
mutate(new = apply(df[cols], 1, function(x) toString(na.omit(x)))) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
regex_row_iteration = df %>% # grepl search after iterating over rows (using syntax I'm not familiar with and need to learn!)
mutate(new = pmap_chr(.[cols], ~paste(na.omit(c(...)), collapse = ','))) %>%
select(new) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
regex_stringi = df %>% mutate(new = pmap_chr(.[cols], ~stringi::stri_flatten( # grepl after stringi
c(...), collapse = ",",
na_empty = TRUE, omit_empty = TRUE
))) %>%
select(new) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
times = 10L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# match_spaces 14820.2076 15060.045 15558.092 15573.885 15901.015 16521.855 10
# match_NAs 998.3184 1061.973 1191.691 1203.849 1301.511 1378.314 10
# regex_str_replace_all 1464.4502 1487.473 1637.832 1596.522 1701.718 2114.055 10
# regex_toString 4324.0914 4341.725 4631.998 4487.373 4977.603 5439.026 10
# regex_row_iteration 5794.5994 6107.475 6458.339 6436.273 6720.185 7256.980 10
# regex_stringi 4772.3859 5267.456 5466.510 5436.804 5806.272 6011.713 10
用空域替换空值(“”)后,看来%in%
是赢家。如果如果我使用正则表达式,那么用stringr::string_replace_all()
删除NA最快。
答案 4 :(得分:0)
如果在使用unite功能时将其删除,则可能会出现一些错误。事实结束后,我将其从列中删除。
df <- data_frame(a = paste0("A.", rep(1, 3)), b = " ", c = c("C.1", "C.3", " "), d = "D.4", e = "E.5")
cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
df <- tidyr::unite(df, new, cols, sep = ",")
df$new <- gsub("NA,","",df$new)