我需要根据相对于某些现有列设置的条件在data.table
中创建新列。但是,我在丢失数据时遇到了一些问题。具体来说,对于每个人,缺少一些数据点。对于某些个人,尽管缺少问卷的全部数据(请参见下面的示例数据中的p
== 3或4列)。在这种情况下(=缺少问卷的全部数据),我希望data.table
在输出中为此特定人员输入NA
。我尝试使用if_else
软件包中的dplyr
解决此问题。但是,data.table
返回NaN or 0
而不是 NA
的结果是,即使某人的 all 数据丢失(例如,当列p
是3或4)。
这是我当前的脚本,仅部分会产生所需的输出(即p
== 1或2的正确输出,而{{ {1}} == 3或4)。
p
以下脚本产生我想要查看的输出。但是,这显然只是出于说明目的,我需要知道如何修改上述脚本以产生所需的结果:
library(data.table)
library(dplyr)
# Create example datatable
set.seed(4)
p <- c(rep(1, 5), rep(2, 5), rep(3, 5), rep(4, 5))
time1 <- as.integer(c(sample(1:20, 5, replace=TRUE), sample(21:40, 5, replace=TRUE), rep("NA",10)))
closeness1 <- as.integer(c(NA, NA, sample(c(1:40,NA), 7, replace=TRUE), NA, rep("NA",10)))
dt <- data.table::data.table(p, time1, closeness1)
# Compute new columns
dt[, c("mean1", "sum1") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.integer(NA), .SD[time1 <= 10, sum(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
我似乎简化了以上示例。我基本上需要基于两个单独的条件来计算closeness1的平均值,一次用于time1 <= 10,一次用于time1> 10&time1 <=21。然后将各自的输出保存在两个新列中。我已经相应地更新了示例脚本,请参见下文:
# Select rows from original data that were as intended
p12 <- dplyr::filter(dt, p %in% c(1,2))
# Create new data.table with corrected output
p <- c(rep(3, 5), rep(4, 5))
time1 <- as.integer(rep("NA",10))
closeness1 <- as.integer(rep("NA",10))
mean1 <- as.integer(rep("NA",10))
sum1 <- as.integer(rep("NA",10))
dt.des <- data.table::data.table(p, time1, closeness1, mean1, sum1)
# Desired output
dsrd.opt <- dplyr::bind_rows(p12, dt.des)
dsrd.opt
p time1 closeness1 mean1 sum1
1 1 12 NA 21.5 43
2 1 1 NA 21.5 43
3 1 6 31 21.5 43
4 1 6 12 21.5 43
5 1 17 5 21.5 43
6 2 26 40 NaN 0
7 2 35 18 NaN 0
8 2 39 19 NaN 0
9 2 39 40 NaN 0
10 2 22 NA NaN 0
11 3 NA NA NA NA
12 3 NA NA NA NA
13 3 NA NA NA NA
14 3 NA NA NA NA
15 3 NA NA NA NA
16 4 NA NA NA NA
17 4 NA NA NA NA
18 4 NA NA NA NA
19 4 NA NA NA NA
20 4 NA NA NA NA
dt[, c("mean1", "mean2") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 > 10 & time1 <= 21, mean(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
答案 0 :(得分:0)
如果我对您的理解正确,我建议使用简单的左联接。我认为这很直观,并且可以达到预期的效果。
dt_result <- merge(x = dt
, y = dt[time1 <= 10, .(mean1 = mean(closeness1, na.rm = TRUE)
, sum1 = sum(closeness1, na.rm = TRUE)), by = list(p)]
, by.x = "p"
, by.y = "p"
, all.x = TRUE
)
> dt_result
p time1 closeness1 mean1 sum1
1: 1 12 NA 21.5 43
2: 1 1 NA 21.5 43
3: 1 6 31 21.5 43
4: 1 6 12 21.5 43
5: 1 17 5 21.5 43
6: 2 26 40 NA NA
7: 2 35 18 NA NA
8: 2 39 19 NA NA
9: 2 39 40 NA NA
10: 2 22 NA NA NA
11: 3 NA NA NA NA
12: 3 NA NA NA NA
13: 3 NA NA NA NA
14: 3 NA NA NA NA
15: 3 NA NA NA NA
16: 4 NA NA NA NA
17: 4 NA NA NA NA
18: 4 NA NA NA NA
19: 4 NA NA NA NA
20: 4 NA NA NA NA