我想创建一个函数,在其中可以指定哪一列应作为锚点或计算的基础。
set.seed(123)
library(data.table)
dt = data.table(Acc_ID = c(1:50),
P1 = sample((0:10000), 50, replace = T),
P2 = sample((0:10000), 50, replace = T),
P3 = sample((0:10000), 50, replace = T),
P4 = sample((0:10000), 50, replace = T),
P5 = sample((0:10000), 50, replace = T),
P6 = sample((0:10000), 50, replace = T),
P7 = sample((0:10000), 50, replace = T),
P8 = sample((0:10000), 50, replace = T),
P9 = sample((0:10000), 50, replace = T),
P10 = sample((0:10000), 50, replace = T),
P11 = sample((0:10000), 50, replace = T),
P12 = sample((0:10000), 50, replace = T))
最终结果应如下所示:
dt[, `:=` (sumcoll1m = `P12`,
sumcoll3m = rowSums(dt[, `P10`:`P12`]),
sumcoll6m = rowSums(dt[, `P7`:`P12`]),
sumcoll12m = rowSums(dt[, `P1`:`P12`]),
payments1m = ifelse(dt[, `P12`] > 0, 1, 0),
payments3m = rowSums(dt[, `P10`:`P12`] > 0),
payments6m = rowSums(dt[, `P7`:`P12`] > 0),
payments12m = rowSums(dt[, `P1`:`P12`] > 0))]
在此示例中,定位点为P12,但可以是任意值,也可以是其他名称。我想要的是具有相同的间隔长度,而不管锚点是什么-例外情况是,如果锚点为P1,则仅在适用的地方进行计算。
有一个聪明的方法吗?
先谢谢您!
编辑:是的,它表示月份。 P5的预期结果将是:
dt[, `:=` (sumcoll1m = `P5`,
sumcoll3m = rowSums(dt[, `P3`:`P5`]),
payments1m = ifelse(dt[, `P5`] > 0, 1, 0),
payments3m = rowSums(dt[, `P3`:`P5`] > 0))]
这是我现在的位置:
dt[, `:=` (sumcoll1m = `P12`,
sumcoll3m = rowSums(dt[, c(which(names(dt) == "P12") - seq(0, 2)), with = F]),
sumcoll6m = rowSums(dt[, c(which(names(dt) == "P12") - seq(0, 5)), with = F]),
sumcoll12m = rowSums(dt[, c(which(names(dt) == "P12") - seq(0, 11)), with = F]),
payments1m = ifelse(dt[, `P12`] > 0, 1, 0),
payments3m = rowSums(dt[, c(which(names(dt) == "P12") - seq(0, 2)), with = F] > 0),
payments6m = rowSums(dt[, c(which(names(dt) == "P12") - seq(0, 5)), with = F] > 0),
payments12m = rowSums(dt[, c(which(names(dt) == "P12") - seq(0, 11)), with = F] > 0))]
答案 0 :(得分:0)
这是一个棘手的问题。我的建议是将数据从宽格式重整为长格式,并使用tail()
在可变长度窗口上计算聚合。
但是首先,我们需要定义一个最小的工作数据集,以帮助验证结果的正确性:
library(data.table)
n_row <- 2
DT <- data.table(Acc_ID = seq_len(n_row))
for (i in 1:12) {
set(DT, , paste0("P", i), (100*seq_len(n_row) + i) * (-1)^i)
}
DT
Acc_ID P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 1: 1 -101 102 -103 104 -105 106 -107 108 -109 110 -111 112 2: 2 -201 202 -203 204 -205 206 -207 208 -209 210 -211 212
long <- melt(DT, "Acc_ID")
long[, variable := as.ordered(variable)]
long
Acc_ID variable value 1: 1 P1 -101 2: 2 P1 -201 3: 1 P2 102 4: 2 P2 202 5: 1 P3 -103 6: 2 P3 -203 7: 1 P4 104 8: 2 P4 204 9: 1 P5 -105 10: 2 P5 -205 11: 1 P6 106 12: 2 P6 206 13: 1 P7 -107 14: 2 P7 -207 15: 1 P8 108 16: 2 P8 208 17: 1 P9 -109 18: 2 P9 -209 19: 1 P10 110 20: 2 P10 210 21: 1 P11 -111 22: 2 P11 -211 23: 1 P12 112 24: 2 P12 212 Acc_ID variable value
variable
已经是一个因素,其级别在从左到右的列顺序中。但是,为了与锚点进行比较,variable
已变成ordered factor
。这样,可以随意命名列,只有列的顺序很重要。
str(long)
Classes ‘data.table’ and 'data.frame': 24 obs. of 3 variables: $ Acc_ID : int 1 2 1 2 1 2 1 2 1 2 ... $ variable: Ord.factor w/ 12 levels "P1"<"P2"<"P3"<..: 1 1 2 2 3 3 4 4 5 5 ... $ value : num -101 -201 102 202 -103 -203 104 204 -105 -205 ... - attr(*, ".internal.selfref")=<externalptr>
OP要求以不同的窗口大小计算聚合,所有聚合都以锚点结尾:
P1
和P2
,将跳过此列,因为列数太少,无法完成一组三个。
长度6,其包括锚点左边的五列和锚点的列。只能为P6
,P7
等列(可使用六列的完整集合)进行计算。
P12
计算得出。尽管OP没有明确提及,但是可以通过使用rowSums()
得出结论,必须分别为每一行计算聚合。在这里,我们假设Acc_ID
唯一地标识每一行。
library(magrittr)
anchor <- "P5"
lapply(c(1, 3, 6, 12),
function(x) {
long[variable <= anchor,
if (x <= .N)
.(sum(tail(value, x)), sum(tail(value, x) > 0)) %>%
setNames(sprintf(c("sumcoll%im", "payments%im"), x)),
by = Acc_ID]
}
) %>%
Reduce(function(x, y) merge(x, y, by = "Acc_ID", all.x = TRUE), .)
Acc_ID sumcoll1m payments1m sumcoll3m payments3m 1: 1 -105 0 -104 1 2: 2 -205 0 -204 1
请注意,尽管数据已重整为长形,但术语列还是用于宽泛地表示数据。
x
仅计算最后tail(value, x)
-最后一列的聚合Acc_ID
分组,即按行分组将各个段合并在一起之前,lapply()
调用的输出如下:
[[1]] Acc_ID sumcoll1m payments1m 1: 1 -105 0 2: 2 -205 0 [[2]] Acc_ID sumcoll3m payments3m 1: 1 -104 1 2: 2 -204 1 [[3]] Empty data.table (0 rows) of 1 col: Acc_ID [[4]] Empty data.table (0 rows) of 1 col: Acc_ID
为了方便起见,可以将其包装在函数调用中
anchored_aggregate <- function(DT, anchor) {
library(data.table)
library(magrittr)
long <- melt(DT, "Acc_ID")
long[, variable := as.ordered(variable)]
lapply(c(1, 3, 6, 12),
function(x) {
long[variable <= anchor,
if (x <= .N)
.(sum(tail(value, x)), sum(tail(value, x) > 0)) %>%
setNames(sprintf(c("sumcoll%im", "payments%im"), x)),
by = Acc_ID]
}
) %>%
Reduce(function(x, y) merge(x, y, by = "Acc_ID", all.x = TRUE), .)
}
anchored_aggregate(DT, "P2")
Acc_ID sumcoll1m payments1m 1: 1 102 1 2: 2 202 1
anchored_aggregate(DT, "P3")
Acc_ID sumcoll1m payments1m sumcoll3m payments3m 1: 1 -103 0 -102 1 2: 2 -203 0 -202 1
anchored_aggregate(DT, "P7")
Acc_ID sumcoll1m payments1m sumcoll3m payments3m sumcoll6m payments6m 1: 1 -107 0 -106 1 -3 3 2: 2 -207 0 -206 1 -3 3
anchored_aggregate(DT, "P12")
Acc_ID sumcoll1m payments1m sumcoll3m payments3m sumcoll6m payments6m sumcoll12m payments12m 1: 1 112 1 111 2 3 3 6 6 2: 2 212 1 211 2 3 3 6 6
OP询问了如何将汇总结果附加到原始数据集中。
这可以通过另一个连接操作来完成,例如,使用上面创建的功能:
DT[anchored_aggregate(DT, "P5"), on = "Acc_ID"]
Acc_ID P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 sumcoll1m payments1m sumcoll3m payments3m 1: 1 -101 102 -103 104 -105 106 -107 108 -109 110 -111 112 -105 0 -104 1 2: 2 -201 202 -203 204 -205 206 -207 208 -209 210 -211 212 -205 0 -204 1
答案 1 :(得分:0)
这是一种适用于列数据的不同方法,但是使用的有序因子和tail()
的技巧与in this answer相同。 .SDcols
参数用于选择所需的列。
但是,没有必要将数据从宽格式重整为长格式。此外,这种方法会立即更新DT
by reference
,因此不需要最终连接。
library(data.table)
# prepare sample data set
n_row <- 2
DT <- data.table(Acc_ID = seq_len(n_row))
for (i in 1:12) {
set(DT, , paste0("P", i), (100*seq_len(n_row) + i) * (-1)^i)
}
# preserve unmodified copy of original dataset
DT0 <- copy(DT)
# create vector of data column names as ordered factor in order of appearance
library(magrittr)
nam_DT <-
# omit id column
colnames(DT)[-1] %>%
forcats::fct_inorder(ordered = TRUE)
anchor <- "P5"
# start with fresh copy of original dataset
DT <- copy(DT0)
# loop ovder window sizes
lapply(c(1, 3, 6, 12),
function(x) {
# create character vector of columns to process
cols <- nam_DT[nam_DT <= anchor] %>%
tail(x) %>%
as.character()
# skip if too few columns available
if (length(cols) == x) {
# compute aggregates and update by reference
DT[, sprintf(c("sumcoll%im", "payments%im"), x) :=
.(rowSums(.SD), rowSums(.SD > 0)), .SDcols = cols]
}
# suppress intermediate results
}) %>% invisible()
# print updated dataset
DT[]
Acc_ID P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 sumcoll1m payments1m sumcoll3m payments3m 1: 1 -101 102 -103 104 -105 106 -107 108 -109 110 -111 112 -105 0 -104 1 2: 2 -201 202 -203 204 -205 206 -207 208 -209 210 -211 212 -205 0 -204 1
为进行比较:
DT[anchored_aggregate(DT, "P5"), on = "Acc_ID"]
Acc_ID P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 sumcoll1m payments1m sumcoll3m payments3m 1: 1 -101 102 -103 104 -105 106 -107 108 -109 110 -111 112 -105 0 -104 1 2: 2 -201 202 -203 204 -205 206 -207 208 -209 210 -211 212 -205 0 -204 1