如何按条件和组过滤我的data.table?

时间:2019-11-07 10:32:41

标签: r filter dplyr data.table

问题

我在data.table上工作,其中每一行都是医学观察。问题是我的数据中有一些错误,在进行分析之前,我需要纠正它们。例如,男性患者可以观察到他被编码为女性的地方。

解决方案

我的解决方案是由患者选择变量的模式(最频繁的值)。如果患者的男性观察结果为10项,女性观察结果为10项,则可以肯定地认为他是男性。

我发现使用data.table的聪明方法。

DATA[j  = .N, 
     by = .(ID, SEX)][i = base::order(-N), 
     j = .(SEX = SEX[1L]), 
     keyby = ID]

问题在于,当患者处于多种模式时,它只会保留一种模式。因此,男性为50%,女性为50%的患者将被视为男性,这最终将导致偏见。我想将它们编码为NA。

纠正此问题的唯一方法是使用dplyr

DATA[j  = .N, 
     by = .(ID, SEX)] %>% 
     group_by(ID) %>% 
     filter(N == max(N))

,然后将SEX值替换为NA(如果重复)。但是它比data.table花费的时间更长,它的优化程度不是很高,而且我的数据集很大,并且有很多变量也需要更正。

恢复

如何让患者使用变量的模式,如果不是唯一的,如何用NA代替它?

示例

ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")

require(data.table)
DATA <- data.table(ID, SEX)

# First method (doesn't work)
DATA[j  = .N, 
     by = .(ID, SEX)][i = base::order(-N), 
     j = .(SEX = SEX[1L]), 
     keyby = ID]

# Second method (work with dplyr)
require(dplyr)
DATA[j  = .N, 
     by = .(ID, SEX)] %>% 
     group_by(ID) %>% 
     filter(N == max(N)) %>%
     mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
                          true = "NA",
                          false = SEX)) %>%
     filter(row_number() == n())

# Applied to my data it took 84.288 seconds

更新

@Cole基于@Sindri_baldur的想法提出的解决方案:

DATA <- data.table(
 ID = c(rep(x = "1", 6), rep(x = "2", 6)),
 SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
 V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)

our_mode_fac <- function(x) {
  freq <- tabulate(x)
       if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {NA}
       else {levels(x)[which.max(freq)]}
  }

vars <- c("SEX", "V1")

DATA[j = paste0(vars) := lapply(.SD, as.factor), 
     .SDcols = vars][j = vars := lapply(.SD, our_mode_fac),
                     .SDcols = vars, 
                     by = ID]

效果很好。即使NA数量多于因子,它还是采用了模式,而当模式多于1种时,就用NA替换值。

现在它也非常快:3M +观察和1M +患者需要11秒(@Sindri_baldur回答为117秒)。非常感谢你们,我非常感谢!

1 个答案:

答案 0 :(得分:2)

our_mode <- function(x) {
  freq <- table(x)
  if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {
    NA
  } else {
    names(freq)[which.max(freq)]
  }
}

vars <- c("SEX", "V1")
DATA[, paste0(vars, "_corrected") := lapply(.SD, our_mode), .SDcols = vars, by = ID]

    ID  SEX   V1 SEX_corrected V1_corrected
 1:  1    M    a             M            a
 2:  1    M <NA>             M            a
 3:  1    M    a             M            a
 4:  1    M    a             M            a
 5:  1    F    b             M            a
 6:  1    M    a             M            a
 7:  2    M    b             F            b
 8:  2    F    b             F            b
 9:  2    M    b             F            b
10:  2    F    c             F            b
11:  2    F    b             F            b
12:  2 <NA>    c             F            b

可复制的数据

DATA <- data.table(
 ID = c(rep(x = "1", 6), rep(x = "2", 6)),
 SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
 V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)

请注意,our_mode()并未针对速度进行优化。请参阅Cole的建议以提高评论速度。