假设我有以下数据表:
dta <- data.table(
criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)
我需要从中得到带有两个附加列cor_start
和cor_end
的结果表
dtb <- data.table(
criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0),
cor_start = c(12.0, 1.0, 8.0, 9.5, 13.0, 6.0),
cor_end = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)
新列需要参考phases
列,方法是检查是否有前一行具有当前匹配的相位值。
为了更好地理解,在此示例中:
但是第1行和第2行没有先前的匹配阶段行。请注意,phase
是列表类型。
因此,当存在上一个匹配行时,以下是条件:
if (max_val in previous matching row is < end_val in current row)
cor_start = previous matching row max_val
cor_end = current row end_val
if (max_val in previous matching row is > end_val in current row)
cor_start = current row end_val
cor_end = current row end_val
,并且当没有先前的匹配行时,以下是条件:
cor_start = current row start_val
cor_end = current row max_val
我调查了shift(),但不知道如何设置上述条件?谢谢!
答案 0 :(得分:0)
类似的东西:
dta_transformed <- dta[,.(rn = .I, phase = unlist(phase)), by = setdiff(names(dta), 'phase')][
, shifted_max := shift(max_val), by = phase][
shifted_max < end_val, `:=` (cor_start = shifted_max, cor_end = end_val), by = phase][
shifted_max > end_val, `:=` (cor_start = end_val, cor_end = end_val), by = phase][
is.na(cor_start), `:=` (cor_start = start_val, cor_end = max_val), by = phase][
, phase := paste(phase, collapse = ","), by = rn][!duplicated(rn),][
, c("rn", "shifted_max") := NULL]
但是,我得到的输出是:
criteria phase start_val end_val max_val cor_start cor_end
1: A block3 12 15 13.0 12.0 13
2: A block1,block2 1 11 8.0 1.0 8
3: B block2 7 11 9.5 8.0 11
4: A block2 7 11 11.0 9.5 11
5: A block3 12 15 15.0 13.0 15
6: B block1 1 6 6.0 6.0 6
在您想要的输出中,第3行cor_end
应该是11吗?由于前一个匹配行(2)具有较低的max_val
,因此应采用当前的end_val
(11)?
还有tidyverse
方法,可读性更高:
library(tidyverse)
dta %>% mutate(rn = row_number()) %>%
unnest(phase) %>%
group_by(phase) %>%
mutate(
cor_start = case_when(
lag(max_val) < end_val ~ lag(max_val),
lag(max_val) > end_val ~ end_val,
TRUE ~ start_val
),
cor_end = if_else(!is.na(lag(max_val)), end_val, max_val)
) %>% group_by(rn) %>%
mutate(
phase = paste(phase, collapse = ",")
) %>% ungroup() %>% select(-rn) %>% distinct()
答案 1 :(得分:0)
这是使用pmin()
而不是ifelse()
并利用fill
函数的shift()
参数的另一种方法。此外,它减少了分组操作的数量:
library(data.table)
dta[, rn := .I]
dta[dta[, .(phase2 = unlist(phase)), by = rn], on = "rn"][
, `:=`(cor_start = pmin(shift(max_val, fill = start_val[1]), end_val),
cor_end = max_val), by = phase2][
, .SD[1], by = rn][
, c("rn", "phase2") := NULL][]
criteria phase start_val end_val max_val cor_start cor_end 1: A block3 12 15 13.0 12.0 13.0 2: A block1,block2 1 11 8.0 1.0 8.0 3: B block2 7 11 9.5 8.0 9.5 4: A block2 7 11 11.0 9.5 11.0 5: A block3 12 15 15.0 13.0 15.0 6: B block1 1 6 6.0 6.0 6.0