在所有列中创建一个带有 NA 的新变量

时间:2021-03-05 12:17:44

标签: r tibble

我有一个 tibble 像这样:

dat  = tibble(a1 = c(23, NA, 3, 0, NA),
                 a2 = c(NA, 6, 0, 9, NA),
                a3 = c(NA, NA, "censored", "censored", NA),
                a4 = c(NA, "censored", NA, NA, NA))

我想创建满足以下条件的名为“class”的新变量:

  • 如果 a1 或 a2 的数字不等于 0,则 class = "yes",
  • 如果所有以字母“a”开头的变量等于NA,则class =“no”,
  • 其他,class = "censored"(这些列中只有一列有“censored”,然后 class = "censored")

3 个答案:

答案 0 :(得分:1)

尝试仅使用基础 R 创建示例。不确定我是否正确理解了所有条件。

我相信使用 dplyrdata.table 可能有更好的解决方案,但我不知道您的偏好。

library(tibble)

# create data
dat  = tibble(
  a1 = c(23, NA, 3, 0, NA),
  a2 = c(NA, 6, 0, 9, NA),
  a3 = c(NA, NA, "censored", "censored", NA),
  a4 = c(NA, "censored", NA, NA, NA)
)

# 1. if either a1 or a2 has the number not equal to 0, then class = "yes" ####

dat$class <- ifelse(dat$a1 != 0 | dat$a2 != 0, 'yes', NA)

# 2. if all variables that start with letter "a" equal to NA, then class = "no" ####

# identify names starting with "a" and create a pattern for grepl
names <- names(dat)[grep("^a.*", names(dat))]
pattern <- paste(names, collapse = '|')

# check if all pattern cols are NA and apply "no" to dat$class
# achieved by comparing row sum of NA cols with ncol()
dat$class <-
  ifelse(rowSums(is.na(dat[, grepl(pattern, colnames(dat))])) == ncol(dat[, grepl(pattern, colnames(dat))]), 'no', dat$class)


# 3. other else, class = "censored" (only one of these columns has "censored", then class = "censored") ####

# check if pattern cols contain "censored" and apply "censored" to dat$class
# achieved by checking for row sum > 0 matching the condition of == "censored"

dat$class <-
  ifelse(rowSums(dat[, grepl(pattern, colnames(dat))] == "censored", na.rm = TRUE) > 0,
         "censored",
         dat$class)

在本例中可以通过索引 dat[,1:4] 访问以“a”开头的列,但您的实际数据可能看起来不同。

更新

基于@NarimeneL 之前给出的解决方案的示例。请注意,case_when 语句的顺序在这里很重要!

library(tibble)
library(dplyr)
library(magrittr)
library(tidyselect)


# create data
dat  = tibble(
  a1 = c(23, NA, 3, 0, NA),
  a2 = c(NA, 6, 0, 9, NA),
  a3 = c(NA, NA, "censored", "censored", NA),
  a4 = c(NA, "censored", NA, NA, NA)
)


dat2 <- dat %>% select(starts_with("a")) %>%
  mutate(class = case_when(
    rowSums(. == "censored", na.rm = TRUE) > 0 ~ "censored" ,
    a1 != 0  ~ "Yes ",
    a2 != 0 ~ "Yes",
    rowSums(is.na(.)) == ncol(.) ~ 'no'
  ))

答案 1 :(得分:1)

我对示例数据有些困惑。 如果我正确理解规则,那么示例中的任何行都不会被审查,因为 a1 或 a2 总是非零,除了最后一行都是 NA。

mutate(dat, class = case_when(
  a1 != 0 | a2 != 0 ~ "yes",
  if_all(starts_with("a"), is.na) ~ "no",
  TRUE ~ "censored"
))
# A tibble: 5 x 5
     a1    a2 a3       a4       class
  <dbl> <dbl> <chr>    <chr>    <chr>
1    23    NA NA       NA       yes  
2    NA     6 NA       censored yes  
3     3     0 censored NA       yes  
4     0     9 censored NA       yes  
5    NA    NA NA       NA       no 

答案 2 :(得分:0)

您可以像这样在数据框上转换表格:

dat = as.data.frame(dat)

然后您可以创建带有条件的新变量:

library(dplyr)
library(magrittr)
library(tidyselect)


dat2 = dat %>% select(starts_with("a")) %>%  mutate(
  class = case_when(
    a1 != 0  ~ "Yes ",
    a2 != 0 ~"Yes"    ))