在3个独特领域中汇总最新分数

时间:2019-02-06 06:09:40

标签: r

我有区域数据集和这些区域的分数。

我想保持聚合分数(agg_score)等于A,B和C的最新分数之和。

例如,您将在我的expected_output中看到第4行为7,因为C的值现在为2,而A和B的最新值仍为1和4。

到目前为止,我所能做的就是将三个最近的分数相加,得出agg_score值有时等于C,C和B的总和。在每个可能的日期都必须有准确的agg_score

library(dplyr)

ds <- 
  tibble(
    area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
    score = c(1,4,5,2,6,3,4,6,3),
    scoring_date = 
      seq.Date(
        from = as.Date("2019-01-01"), 
        to = as.Date("2019-01-09"), 
        by = "days"
      ),
    expected_output = c(NA, NA, 10, 7, 9, 11, 12, 12, 13)
  ) %>%
  arrange(scoring_date)

# Inadequate code for summing last three scores
ds %>% 
  mutate(agg_score = score + lag(score) + lag(score, 2))

5 个答案:

答案 0 :(得分:2)

那里可能有一个data.table自合并选项,但我不太清楚。这是一个在data.table中实现填充的想法。对于更多的“区域”应该灵活:

library(data.table)

lapply(unique(ds$area), function(a){
  ds[, paste0("val_",a) := zoo::na.locf0(ifelse(area==a, score, NA))]
  invisible(return(NULL))
})
ds[, agg_score := rowSums(.SD), .SDcols = paste0("val_", unique(ds$area))][,  paste0("val_", unique(ds$area)) := NULL]

ds
#  area score scoring_date agg_score
#1    A     1   2019-01-01        NA
#2    B     4   2019-01-02        NA
#3    C     5   2019-01-03        10
#4    C     2   2019-01-04         7
#5    B     6   2019-01-05         9
#6    A     3   2019-01-06        11
#7    A     4   2019-01-07        12
#8    B     6   2019-01-08        12
#9    C     3   2019-01-09        13

原始解决方案:

或者,您可以尝试使用sapply。该函数有点长,但这是因为我们还有很多工作要做!如果您想在更多区域执行此操作,则不必手动填充每个区域,这样可能会有好处:

ds$agg_score <- sapply(1:nrow(ds), function(i) {other_areas <- setdiff(unique(ds$area), ds[i, "area"])
                                                f_idxs = Filter(function(x) x < i, which(ds$area %in% other_areas)) #Locate other areas that come before current index
                                                if(length(f_idxs) == 0) return(NA)
                                                idxs   = sapply(split(f_idxs, ds[f_idxs, "area"]), max) #Split based on area so we can get maximum index before our date
                                                if(length(idxs) < length(other_areas)) return(NA)
                                                sum(ds[c(idxs, i), "score"])}) #Sum up our scores

答案 1 :(得分:2)

使用dplyr::last,我们可以找到每个区域的最后一个“最近”值,然后在长度达到3时求和。

#small function to clarify
sum_fun<-function(x){
  #browser()
  lc_vec <- ds[1:x,] %>% group_by(area) %>% summarise(lc=last(score)) %>% pull(lc)  
  lc_vecf <- ifelse(length(lc_vec)==3,sum(lc_vec),NA)
  return(lc_vecf)
}

library(dplyr)
ds %>% mutate(Output=sapply(1:nrow(.),sum_fun)) #Instead of sapply we can use purrr::map_dpl

# A tibble: 9 x 5
area  score scoring_date expected_output Output
<chr> <dbl> <date>                 <dbl>  <dbl>
1 A        1. 2019-01-01               NA     NA 
2 B        4. 2019-01-02               NA     NA 
3 C        5. 2019-01-03               10.    10.
4 C        2. 2019-01-04                7.     7.
5 B        6. 2019-01-05                9.     9.
6 A        3. 2019-01-06               11.    11.
7 A        4. 2019-01-07               12.    12.
8 B        6. 2019-01-08               12.    12.
9 C        3. 2019-01-09               13.    13.

答案 2 :(得分:1)

因此,我找到了一种使用fill()来执行此操作的方法,以确保始终保留最新值,直到被更新后的值替换为止。

library(tidyr)
ds %>% 
  select(area, score, scoring_date) %>% 
  spread(area, score) %>% 
  fill(A, .direction = "down") %>% 
  fill(B, .direction = "down") %>% 
  fill(C, .direction = "down") %>% 
  rowwise() %>% 
  mutate(agg_score = sum(A, B, C))

答案 3 :(得分:0)

nuevoDs<-ds %>% arrange(desc(scoring_date)) %>% as.data.frame
#getting length of dataframe
longitud<-nrow(nuevoDs)
#we will iterate on each value up until (longitud - 2) and save results to a vector
elVector <- vector()

for(i in 1:(longitud-2))
{
  elVector[i] <- nuevoDs[i,"score"] + nuevoDs[i+1,"score"] + nuevoDs[i+2,"score"]

}

#before cbinding we need to make the vector the same length as your dataFrame

elVector[longitud-1] <- 0
elVector[longitud] <- 0

elVector

cbind(nuevoDs,elVector)




 area score scoring_date elVector
1    C     3   2019-01-09       13
2    B     6   2019-01-08       13
3    A     4   2019-01-07       13
4    A     3   2019-01-06       11
5    B     6   2019-01-05       13
6    C     2   2019-01-04       11
7    C     5   2019-01-03       10
8    B     4   2019-01-02        0
9    A     1   2019-01-01        0

答案 4 :(得分:0)

另一种可能的data.table方法。

ds[, output := 
        ds[, 
            ds[.(area=unique(area), scd=.BY$scoring_date), 
                sum(score), 
                on=.(area=area, scoring_date<=scd), 
                mult="last"], 
            by=.(area, scoring_date)]$V1
    ]

输出:

   area score scoring_date output
1:    A     1   2019-01-01     NA
2:    B     4   2019-01-02     NA
3:    C     5   2019-01-03     10
4:    C     2   2019-01-04      7
5:    B     6   2019-01-05      9
6:    A     3   2019-01-06     11
7:    A     4   2019-01-07     12
8:    B     6   2019-01-08     12
9:    C     3   2019-01-09     13

数据:

library(data.table)
ds <- data.table(
    area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
    score = c(1,4,5,2,6,3,4,6,3),
    scoring_date = seq.Date(from = as.Date("2019-01-01"), to = as.Date("2019-01-09"), by = "days"))

说明:

以上代码的要旨是:

ds[.(area=unique(area), scd=.BY$scoring_date), 
    sum(score), 
    on=.(area=area, scoring_date<=scd), 
    mult="last"]

这意味着对于每个日期(scd=.BY$scoring_date),我们都尝试执行非等价自连接,以查找所有区域(mult="last")的最新(area=unique(area))分数