按因子除以并带回

时间:2017-05-25 02:04:01

标签: r data.table

我有一个看起来像的数据表:

A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7)
M <- data.table(A,B,C,D) 

我的问题类似于R help: divide values by sum produced through factor,但有一些额外的考虑因素。 A指定日期(我只是在这里使用整数)。 B是个人。 C是个人所属的分类。 D是一个值变量。

对于c的每个分类C,对于a的每一天A,将值D除以所有个体的值的总和在c中,在需要时向后移动0<x-a<=N其中x是另一个人的日期(意味着我们选择最小的xa并将其用作组中其他个人的值的近似值) {1}}在第一天a)。

我们说N = 5。这是我的预期输出。

c

请注意,对于个体3,B组的值不会后退,因为长度大于5(20-5)。在A <- c(1,3,5,20,21,21) B <- c(1, 2, 3, 4, 5, 6) C <- c("I","I","II","II","III","III") D <- c(0.7/(0.7+0.3), 0.3/(0.3), 0.5/(0.5), 0.9/(0.9), 4/(4+7), 7/(4+7)) M <- data.table(A,B,C,D) 中有没有一种很好的方法呢?

对于D中的每个值,我希望除以当天同一组(I,II,II)的所有值的总和。但是,对于某些群体,您会注意到当天不存在观察结果。我会尝试通过一些观察来完成逻辑。

编辑:让我试着看几个案例。

对于第1天(A栏)的个体1(B栏),个体属于I组(C栏)。第一组中的其他个体是:2。对于其他每个人,我们看到对于个体2,他们最近的观察是在第3天和3-1 <= 5,所以我们在分母中使用0.3。 / p>

对于第5天(A栏)的个体3(B栏),个体属于第II组(C栏)。第二组中的其他个体是:3。对于其他每个人,我们看到对于个体3,他们最近的观察是在第20天和20-5> 5,所以我们不能在分母中使用他们的观察。

2 个答案:

答案 0 :(得分:0)

我想,这会给你答案:

A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
V <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)

N=5
#Put data into a frame
test = data.frame(A,B,C,V)
#order the data
test = test[order(as.numeric(test$C), test$A),]
#Get the 'rollback' possibilities for each value
Roll = sapply(test$A, FUN = function(x){paste(which(test$A < (x+N) & test$A >= x), collapse=",")})
#Get the groupings
Group = sapply(test$C, FUN = function(x){paste(which(test$C == x), collapse=",")})
#Intersect the values
ToGet = apply(cbind(Roll, Group), MARGIN=1, FUN=function(x){intersect(unlist(strsplit(x[1],",")), unlist(strsplit(x[2],",")))})
#Calculate the denominators
test$D = sapply(ToGet, FUN=function(x){sum(test$V[as.numeric(x)])})
test$Calc = test$V/test$D

输出:

> test
   A B   C   V    D      Calc
1  1 1   I 0.7  1.0 0.7000000
2  3 2   I 0.3  0.4 0.7500000
7  7 7   I 0.1  0.1 1.0000000
3  5 3  II 0.5  0.5 1.0000000
4 20 4  II 0.9  0.9 1.0000000
5 21 5 III 4.0 11.0 0.3636364
6 21 6 III 7.0 11.0 0.6363636

答案 1 :(得分:0)

问题标有data.table,因此这里有一个data.table解决方案,它使用非等联接来识别每个组中的个人,将其视为如果观察结果属于5天的日期窗口,则为群组

library(data.table)   # CRAN version 1.10.4 used

# set length of date window in days 
N <- 5L
# give columns more semantic names according to OP's description 
setnames(M, c("day", "id", "grp", "val"))

# prepare data for non-equi join: allowable date range
ranged <- M[, .(start = day, end = day + N, co.id = id, grp)]

# non-equi join to determine cohort
joined <- M[ranged, on = c("grp", "day>=start", "day<=end")]

# compute denominator for each cohort
grouped <- joined[, .(den = sum(val)), by = co.id]

# final update on join and order
result <- M[grouped, on = c("id==co.id"), calc := val / den][order(grp, id)]

result
#   day id grp val      calc
#1:   1  1   I 0.7 0.7000000
#2:   3  2   I 0.3 0.7500000
#3:   7  7   I 0.1 1.0000000
#4:   5  3  II 0.5 1.0000000
#5:  20  4  II 0.9 1.0000000
#6:  21  5 III 4.0 0.3636364
#7:  21  6 III 7.0 0.6363636

数据

A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
M <- data.table(A,B,C,D)

紧凑版

对于那些喜欢紧凑代码的人来说,这是一个更复杂的版本:

joined <- M[M[, .(start = day, end = day + N, co.id = id, grp)], 
            on = c("grp", "day>=start", "day<=end")]
M[joined[, .(den = sum(val)), by = co.id], on = c("id==co.id"), 
            calc := val / den][order(grp, id)]

或者,作为“一线”:

M[M[M[, .(start = day, end = day + N, co.id = id, grp)], 
    on = c("grp", "day>=start", "day<=end")
    ][, .(den = sum(val)), co.id], 
  on = c("id==co.id"), calc := val / den][order(grp, id)]