我在data.table上工作,其中每一行都是医学观察。问题是我的数据中有一些错误,在进行分析之前,我需要纠正它们。例如,男性患者可以观察到他被编码为女性的地方。
我的解决方案是由患者选择变量的模式(最频繁的值)。如果患者的男性观察结果为10项,女性观察结果为10项,则可以肯定地认为他是男性。
我发现使用data.table的聪明方法。
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
问题在于,当患者处于多种模式时,它只会保留一种模式。因此,男性为50%,女性为50%的患者将被视为男性,这最终将导致偏见。我想将它们编码为NA。
纠正此问题的唯一方法是使用dplyr
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N))
,然后将SEX值替换为NA(如果重复)。但是它比data.table花费的时间更长,它的优化程度不是很高,而且我的数据集很大,并且有很多变量也需要更正。
如何让患者使用变量的模式,如果不是唯一的,如何用NA代替它?
ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")
require(data.table)
DATA <- data.table(ID, SEX)
# First method (doesn't work)
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
# Second method (work with dplyr)
require(dplyr)
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N)) %>%
mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
true = "NA",
false = SEX)) %>%
filter(row_number() == n())
# Applied to my data it took 84.288 seconds
@Cole基于@Sindri_baldur的想法提出的解决方案:
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
our_mode_fac <- function(x) {
freq <- tabulate(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {NA}
else {levels(x)[which.max(freq)]}
}
vars <- c("SEX", "V1")
DATA[j = paste0(vars) := lapply(.SD, as.factor),
.SDcols = vars][j = vars := lapply(.SD, our_mode_fac),
.SDcols = vars,
by = ID]
效果很好。即使NA数量多于因子,它还是采用了模式,而当模式多于1种时,就用NA替换值。
现在它也非常快:3M +观察和1M +患者需要11秒(@Sindri_baldur回答为117秒)。非常感谢你们,我非常感谢!
答案 0 :(得分:2)
our_mode <- function(x) {
freq <- table(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {
NA
} else {
names(freq)[which.max(freq)]
}
}
vars <- c("SEX", "V1")
DATA[, paste0(vars, "_corrected") := lapply(.SD, our_mode), .SDcols = vars, by = ID]
ID SEX V1 SEX_corrected V1_corrected
1: 1 M a M a
2: 1 M <NA> M a
3: 1 M a M a
4: 1 M a M a
5: 1 F b M a
6: 1 M a M a
7: 2 M b F b
8: 2 F b F b
9: 2 M b F b
10: 2 F c F b
11: 2 F b F b
12: 2 <NA> c F b
可复制的数据
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
请注意,our_mode()
并未针对速度进行优化。请参阅Cole的建议以提高评论速度。