我正在尝试估算data.table
中 n 因子的某些参数。虽然我熟悉使用by
功能来执行某个因素的操作;为多个顺序因素执行此操作会导致一些问题。
例如,使用简化集
df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), Variable = round(rnorm(16), 2))
Group Variable
1: A 0.13
2: A 0.26
3: B -1.36
4: B -0.78
5: B -0.92
6: C 0.00
7: C -2.49
8: D -1.85
9: D 0.37
10: D -0.57
11: D 1.42
12: E -0.72
13: F -1.04
14: F 1.86
15: F 0.49
16: F 1.61
使用df[, mean(Variable), by = Group]
将给出每个组的平均值。但是,我想计算以前 n 组的平均值
我尝试使用M[, zoo::rollapply(Variable, n, mean), by = Group]
,因为使用固定的 n 的群组大小不同。
想要的功能类似于df[, mean(Variable), by = "This Group and previous n Groups]
。
我想要实现的输出(对于 n = 3的情况)看起来像
Group Variable
1: A NA
2: A NA
3: B NA
4: B NA
5: B NA
6: C 0.13
7: C 0.13
8: D -1.36
9: D -1.36
10: D -1.36
11: D -1.36
12: E 0
13: F -1.85
14: F -1.85
15: F -1.85
16: F -1.85
任何帮助都将不胜感激。
答案 0 :(得分:4)
library(data.table)
library(RcppRoll)
df1 <- df[, .(n=.N, S=sum(Variable)), by = Group]
df1[, NewVariable:=roll_sum(S, 3, align="right", fill=NA)/roll_sum(n, 3, align="right", fill=NA),]
df[df1, on="Group"]
Group Variable n S NewVariable
1: A -0.63 2 -0.45 NA
2: A 0.18 2 -0.45 NA
3: B -0.84 3 1.09 NA
4: B 1.60 3 1.09 NA
5: B 0.33 3 1.09 NA
6: C -0.82 2 -0.33 0.04428571
7: C 0.49 2 -0.33 0.04428571
8: D 0.74 4 2.52 0.36444444
9: D 0.58 4 2.52 0.36444444
10: D -0.31 4 2.52 0.36444444
11: D 1.51 4 2.52 0.36444444
12: E 0.39 1 0.39 0.36857143
13: F -0.62 4 -1.75 0.12888889
14: F -2.21 4 -1.75 0.12888889
15: F 1.12 4 -1.75 0.12888889
16: F -0.04 4 -1.75 0.12888889
我希望我的解决方案不言自明。
dplyr
等效于
df %>%
group_by(Group) %>%
summarise(n=n(), S=sum(Variable)) %>%
mutate(NewVar=roll_sum(S, 3, align="right", fill=NA)/roll_sum(n, 3, align="right", fill=NA)) %>%
left_join(df, by="Group")
数据
set.seed(1)
df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), Variable = round(rnorm(16), 2))
包裹信息
[1] RcppRoll_0.2.2 data.table_1.9.5
答案 1 :(得分:2)
这可能不是最有效的方法,但它有效:
首先,让我们设置种子的可重复性:
set.seed(1038)
> df
Group Variable
1: A -0.86
2: A 0.57
3: B 0.10
4: B -1.57
5: B 1.73
6: C -0.56
7: C 0.54
8: D -1.71
9: D -0.47
10: D -1.00
11: D 1.03
12: E -0.47
13: F -1.06
14: F -2.06
15: F -0.57
16: F 1.70
现在消除强制转换Group
为整数,使n-1
更加切实,然后按grp_no
浓缩所有多个观察值:
setkey(df[ , grp_no := as.integer(as.factor(Group))], grp_no)
df_ttls <- df[ , .(ttl = sum(Variable), .N), by = grp_no]
> df_ttls
grp_no ttl N
1: 1 -0.29 2
2: 2 0.26 3
3: 3 -0.02 2
4: 4 -2.15 4
5: 5 -0.47 1
6: 6 -1.99 4
现在使用shift
创建您要搜索的加权平均值:
df_ttls[ , lag3avg := rowSums(sapply(0:2, shift, x = ttl))/
rowSums(sapply(0:2, shift, x = N))]
并合并回完整的数据集:
df[df_ttls, lag3avg := i.lag3avg][ ]
Group Variable grp_no lag3avg
1: A -0.86 1 NA
2: A 0.57 1 NA
3: B 0.10 2 NA
4: B -1.57 2 NA
5: B 1.73 2 NA
6: C -0.56 3 -0.007142857
7: C 0.54 3 -0.007142857
8: D -1.71 4 -0.212222222
9: D -0.47 4 -0.212222222
10: D -1.00 4 -0.212222222
11: D 1.03 4 -0.212222222
12: E -0.47 5 -0.377142857
13: F -1.06 6 -0.512222222
14: F -2.06 6 -0.512222222
15: F -0.57 6 -0.512222222
16: F 1.70 6 -0.512222222
请注意,这可以很容易地扩展到一个功能:
k_lag_avg <- function(k){
df[df_ttls[ , .(grp_no, rowSums(sapply(1:k - 1L, shift, x = ttl))/
rowSums(sapply(1:k -1L, shift, x = N)))],
paste0("lag", k, "avg") := i.V2]
}
k_lag_avg(5L); df[ ]
Group Variable grp_no lag3avg lag5avg
1: A -0.86 1 NA NA
2: A 0.57 1 NA NA
3: B 0.10 2 NA NA
4: B -1.57 2 NA NA
5: B 1.73 2 NA NA
6: C -0.56 3 -0.007142857 NA
7: C 0.54 3 -0.007142857 NA
8: D -1.71 4 -0.212222222 NA
9: D -0.47 4 -0.212222222 NA
10: D -1.00 4 -0.212222222 NA
11: D 1.03 4 -0.212222222 NA
12: E -0.47 5 -0.377142857 -0.2225000
13: F -1.06 6 -0.512222222 -0.3121429
14: F -2.06 6 -0.512222222 -0.3121429
15: F -0.57 6 -0.512222222 -0.3121429
16: F 1.70 6 -0.512222222 -0.3121429
答案 2 :(得分:1)
如果您愿意将data.table转换为data.frame并执行此过程,我可以帮助您。 查看此示例并逐步执行命令以查看其工作原理。这个例子涉及你提到的n = 3的情况。
library(dplyr)
df <- data.frame(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)),
Variable = round(rnorm(16), 2))
df %>% group_by(Group) %>%
do(data.frame(df2 = df)) %>%
mutate(diff = as.numeric(Group) - as.numeric(df2.Group)) %>%
filter(diff %in% 0:2) %>%
mutate(unique_pairs = n_distinct(diff)) %>%
filter(unique_pairs ==3) %>%
mutate(Mean = mean(df2.Variable)) %>%
filter(diff==0) %>%
select(Group, Mean) %>%
ungroup
哲学只是创建&#34; Group&#34;之间的所有组合。名称,然后创建一些有用的列来过滤。 您可以使用for循环执行此过程,但我希望它更慢。
如果你真的想使用data.table(仍然是dplyr但后台的data.table结构)试试这个:
library(dplyr)
library(data.table)
df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)),
Variable = round(rnorm(16), 2))
df = df %>% mutate(Group2 = as.numeric(as.factor(Group)))
df %>%
group_by(Group2, Group) %>%
do(data.table(df2 = df)) %>%
mutate(diff = Group2 - df2.Group2) %>%
filter(diff %in% 0:2) %>%
group_by(Group2, Group) %>%
mutate(unique_pairs = n_distinct(diff)) %>%
filter(unique_pairs ==3) %>%
group_by(Group2, Group) %>%
mutate(Mean = mean(df2.Variable)) %>%
filter(diff==0) %>%
select(Group2, Group, Mean) %>%
ungroup
这里data.table不喜欢因素,所以我不得不使用数字而不是Group变量的字母。 此外,在每次变异之后我不得不再次分组(当你想在后台使用data.table时,这是一个已知的dplyr问题)。虽然哲学完全相同。