我有区域数据集和这些区域的分数。
我想保持聚合分数(agg_score
)等于A,B和C的最新分数之和。
例如,您将在我的expected_output
中看到第4行为7,因为C的值现在为2,而A和B的最新值仍为1和4。
到目前为止,我所能做的就是将三个最近的分数相加,得出agg_score
值有时等于C,C和B的总和。在每个可能的日期都必须有准确的agg_score
。
library(dplyr)
ds <-
tibble(
area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
score = c(1,4,5,2,6,3,4,6,3),
scoring_date =
seq.Date(
from = as.Date("2019-01-01"),
to = as.Date("2019-01-09"),
by = "days"
),
expected_output = c(NA, NA, 10, 7, 9, 11, 12, 12, 13)
) %>%
arrange(scoring_date)
# Inadequate code for summing last three scores
ds %>%
mutate(agg_score = score + lag(score) + lag(score, 2))
答案 0 :(得分:2)
那里可能有一个data.table
自合并选项,但我不太清楚。这是一个在data.table
中实现填充的想法。对于更多的“区域”应该灵活:
library(data.table)
lapply(unique(ds$area), function(a){
ds[, paste0("val_",a) := zoo::na.locf0(ifelse(area==a, score, NA))]
invisible(return(NULL))
})
ds[, agg_score := rowSums(.SD), .SDcols = paste0("val_", unique(ds$area))][, paste0("val_", unique(ds$area)) := NULL]
ds
# area score scoring_date agg_score
#1 A 1 2019-01-01 NA
#2 B 4 2019-01-02 NA
#3 C 5 2019-01-03 10
#4 C 2 2019-01-04 7
#5 B 6 2019-01-05 9
#6 A 3 2019-01-06 11
#7 A 4 2019-01-07 12
#8 B 6 2019-01-08 12
#9 C 3 2019-01-09 13
原始解决方案:
或者,您可以尝试使用sapply
。该函数有点长,但这是因为我们还有很多工作要做!如果您想在更多区域执行此操作,则不必手动填充每个区域,这样可能会有好处:
ds$agg_score <- sapply(1:nrow(ds), function(i) {other_areas <- setdiff(unique(ds$area), ds[i, "area"])
f_idxs = Filter(function(x) x < i, which(ds$area %in% other_areas)) #Locate other areas that come before current index
if(length(f_idxs) == 0) return(NA)
idxs = sapply(split(f_idxs, ds[f_idxs, "area"]), max) #Split based on area so we can get maximum index before our date
if(length(idxs) < length(other_areas)) return(NA)
sum(ds[c(idxs, i), "score"])}) #Sum up our scores
答案 1 :(得分:2)
使用dplyr::last
,我们可以找到每个区域的最后一个“最近”值,然后在长度达到3时求和。
#small function to clarify
sum_fun<-function(x){
#browser()
lc_vec <- ds[1:x,] %>% group_by(area) %>% summarise(lc=last(score)) %>% pull(lc)
lc_vecf <- ifelse(length(lc_vec)==3,sum(lc_vec),NA)
return(lc_vecf)
}
library(dplyr)
ds %>% mutate(Output=sapply(1:nrow(.),sum_fun)) #Instead of sapply we can use purrr::map_dpl
# A tibble: 9 x 5
area score scoring_date expected_output Output
<chr> <dbl> <date> <dbl> <dbl>
1 A 1. 2019-01-01 NA NA
2 B 4. 2019-01-02 NA NA
3 C 5. 2019-01-03 10. 10.
4 C 2. 2019-01-04 7. 7.
5 B 6. 2019-01-05 9. 9.
6 A 3. 2019-01-06 11. 11.
7 A 4. 2019-01-07 12. 12.
8 B 6. 2019-01-08 12. 12.
9 C 3. 2019-01-09 13. 13.
答案 2 :(得分:1)
因此,我找到了一种使用fill()来执行此操作的方法,以确保始终保留最新值,直到被更新后的值替换为止。
library(tidyr)
ds %>%
select(area, score, scoring_date) %>%
spread(area, score) %>%
fill(A, .direction = "down") %>%
fill(B, .direction = "down") %>%
fill(C, .direction = "down") %>%
rowwise() %>%
mutate(agg_score = sum(A, B, C))
答案 3 :(得分:0)
nuevoDs<-ds %>% arrange(desc(scoring_date)) %>% as.data.frame
#getting length of dataframe
longitud<-nrow(nuevoDs)
#we will iterate on each value up until (longitud - 2) and save results to a vector
elVector <- vector()
for(i in 1:(longitud-2))
{
elVector[i] <- nuevoDs[i,"score"] + nuevoDs[i+1,"score"] + nuevoDs[i+2,"score"]
}
#before cbinding we need to make the vector the same length as your dataFrame
elVector[longitud-1] <- 0
elVector[longitud] <- 0
elVector
cbind(nuevoDs,elVector)
area score scoring_date elVector
1 C 3 2019-01-09 13
2 B 6 2019-01-08 13
3 A 4 2019-01-07 13
4 A 3 2019-01-06 11
5 B 6 2019-01-05 13
6 C 2 2019-01-04 11
7 C 5 2019-01-03 10
8 B 4 2019-01-02 0
9 A 1 2019-01-01 0
答案 4 :(得分:0)
另一种可能的data.table
方法。
ds[, output :=
ds[,
ds[.(area=unique(area), scd=.BY$scoring_date),
sum(score),
on=.(area=area, scoring_date<=scd),
mult="last"],
by=.(area, scoring_date)]$V1
]
输出:
area score scoring_date output
1: A 1 2019-01-01 NA
2: B 4 2019-01-02 NA
3: C 5 2019-01-03 10
4: C 2 2019-01-04 7
5: B 6 2019-01-05 9
6: A 3 2019-01-06 11
7: A 4 2019-01-07 12
8: B 6 2019-01-08 12
9: C 3 2019-01-09 13
数据:
library(data.table)
ds <- data.table(
area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
score = c(1,4,5,2,6,3,4,6,3),
scoring_date = seq.Date(from = as.Date("2019-01-01"), to = as.Date("2019-01-09"), by = "days"))
说明:
以上代码的要旨是:
ds[.(area=unique(area), scd=.BY$scoring_date),
sum(score),
on=.(area=area, scoring_date<=scd),
mult="last"]
这意味着对于每个日期(scd=.BY$scoring_date
),我们都尝试执行非等价自连接,以查找所有区域(mult="last"
)的最新(area=unique(area)
)分数>