合并R

时间:2019-02-01 03:17:04

标签: r merge dplyr intervals

我正在尝试合并重叠的间隔,以计算唯一间隔的总和,同时删除排除的间隔。

这是一个最小的工作示例:

mydata <- data.frame(interval = c(1,2,3,4,5,6,7,8,9,10),
                     timeoutStart = c(280,500,NA,NA,NA,NA,NA,NA,NA,NA),
                     timeoutEnd = c(310,530,NA,NA,NA,NA,NA,NA,NA,NA),
                     cheeringStart = c(1,181,205,330,460,740,NA,NA,NA,NA),
                     cheeringEnd = c(120,199,300,420,475,760,NA,NA,NA,NA),
                     possessionStart = c(80,180,210,250,350,450,550,650,750,800),
                     possessionEnd = c(130,200,220,280,400,499,600,700,800,950)
)

interval timeoutStart timeoutEnd cheeringStart cheeringEnd possessionStart possessionEnd
       1          280        310             1         120              80           130
       2          500        530           181         199             180           200
       3           NA         NA           205         300             210           220
       4           NA         NA           330         420             250           280
       5           NA         NA           460         475             350           400
       6           NA         NA           740         760             450           499
       7           NA         NA            NA          NA             550           600
       8           NA         NA            NA          NA             650           700
       9           NA         NA            NA          NA             750           800
      10           NA         NA            NA          NA             800           950

在上面的最小工作示例中,我想计算球队花费欢呼或拥有球的总时间(不包括超时)。矩阵中的值表示每个结果(timeoutcheeringpossession)的不同间隔的开始和结束时间(自游戏开始以来经过的秒数)。结果不是互斥的,可以同时发生。但是,我不想重复计算cheeringpossession的重叠间隔。也就是说,我想合并cheeringpossession的重叠间隔,以便可以对“唯一”间隔进行求和。

例如,一个欢呼间隔发生在740到760秒之间,而拥有间隔则与该间隔重叠(750到800秒)。合并间隔为740到800秒(持续时间= 60秒)。

在合并cheeringpossession的重叠间隔之后,我想排除超时间隔。例如,对于从205到300秒的唯一时间间隔,我想排除从280到310秒的超时间隔。因此,不包括超时间隔的唯一间隔将是205到280秒(持续时间= 75秒)。

我要计算每个唯一间隔(EndStart)的持续时间(不包括超时间隔),然后计算所有这些唯一间隔持续时间(不包括超时间隔)的总和。最后,我希望能够基于该行中另一个变量(keep = 0或1)的值在计算中包括或排除间隔。

让我们假设StartEnd时间栏未预先排序。我还希望该方法具有通用性,能够轻松地将多个其他列集添加到总和中(例如,盘带,传球等)。我查看了其他答案,但没有找到一种方法来概括他们针对我的情况的解决方案。

2 个答案:

答案 0 :(得分:1)

这是使用data.table的{​​{1}}执行重叠连接的解决方案。 这只是部分解决方案……提供所需的输出会有所帮助。但是您可能可以在此代码的基础上获得所需的内容。

假设您的数据名为foverlaps()

df

我建议您阅读library( data.table ) #create data.tables for cheers and possession cheers.dt <- data.table( interval.cheer = df$interval, start.cheer = df$cheeringStart, end.cheer = df$cheeringEnd )[!is.na(start.cheer),] possession.dt <- data.table( interval.pos = df$interval, start.pos = df$possessionStart, end.pos = df$possessionEnd ) #set keys setkey( cheers.dt, start.cheer, end.cheer ) #perform overlap-join foverlaps( possession.dt, cheers.dt, by.x = c( "start.pos", "end.pos" ), type = "any", mult = "all", nomatch = NULL ) # interval.cheer start.cheer end.cheer interval.pos start.pos end.pos # 1: 1 1 120 1 80 130 # 2: 2 181 199 2 180 200 # 3: 3 205 300 3 210 220 # 4: 3 205 300 4 250 280 # 5: 4 330 420 5 350 400 # 6: 5 460 475 6 450 499 # 7: 6 740 760 9 750 800 的{​​{1}}函数和非等额联接。

答案 1 :(得分:1)

怎么样?

mydata <- data.frame(interval = c(1,2,3,4,5,6,7,8,9,10),
                     timeoutStart = c(280,500,NA,NA,NA,NA,NA,NA,NA,NA),
                     timeoutEnd = c(310,530,NA,NA,NA,NA,NA,NA,NA,NA),
                     cheeringStart = c(1,181,205,330,460,740,NA,NA,NA,NA),
                     cheeringEnd = c(120,199,300,420,475,760,NA,NA,NA,NA),
                     possessionStart = c(80,180,210,250,350,450,550,650,750,800),
                     possessionEnd = c(130,200,220,280,400,499,600,700,800,950),
                     keep = c(rep(FALSE, 2), rep(TRUE, 8)) #added for illustration
)

#add whatever columns you want to use to calculate the merged interval
#they must be in the same order in both vectors
#e.g. if 'cheeringStart' is at index 1, so must 'cheeringEnd'
intervalStartCols <- c('cheeringStart', 'possessionStart')
intervalEndCols <- c('cheeringEnd', 'possessionEnd')
intervalCols <- c(intervalStartCols, intervalEndCols)
timeoutCols <- c('timeoutStart', 'timeoutEnd')

mydata$mergedDuration <- apply(mydata, MARGIN = 1, FUN = function(row){

  #return zero if all NAs
  if(all(is.na(row[intervalCols]))) return(0)

  if(!all(is.na(row[timeoutCols]))){
    timeout.start <- row['timeoutStart']
    timeout.end <- row['timeoutEnd']
  } else {
    timeout.end <- 0
  }

  #identify the maximum time (this will be the end of the merged interval)
  max.end <- max(row[intervalEndCols], na.rm=TRUE)

  #set intial values
  duration <- 0
  segment.complete <- FALSE
  start.i <- which(row[intervalStartCols] == min(row[intervalStartCols], na.rm=TRUE))
  next.step <- row[intervalStartCols][start.i]

  waypoints <- row[intervalCols]
  waypoints <- waypoints[!is.na(waypoints)]
  waypoints <- waypoints[waypoints!=next.step]

  #calculate interval duration adjusting for overlap
  while(next.step < max.end){

    start <- row[intervalStartCols][start.i]

    next.step <- waypoints[waypoints == min(waypoints[waypoints!=next.step])]
    if(segment.complete){
      start.i <- which(row[intervalStartCols] == next.step)
      segment.complete <- FALSE
    }
    end.i <- which(row[intervalEndCols] == next.step)

    waypoints <- waypoints[waypoints!=next.step]

    if(length(end.i) > 0 && length(start.i) >0 && end.i == start.i) {

      segment.start <- row[intervalStartCols][start.i]
      segment.end <- row[intervalEndCols][end.i]
      segment.duration <- segment.end - segment.start

      #adjust for timeout
      timeout.adj <- {
        if (timeout.end == 0) 0 #this is the NA case
        else if(timeout.start > segment.end | timeout.end < segment.start) 0
        else if(timeout.end > segment.end & timeout.start < segment.start) segment.duration
        else if(timeout.end < segment.end) timeout.end - segment.start
        else segment.end - timeout.start
      }

      duration <- duration + segment.duration - timeout.adj
      segment.complete <- TRUE
    }

  }

  duration
})

#sum duration using 'keep' column as mask
summed.duration <- sum(mydata[mydata$keep, 'mergedDuration'])
print(summed.duration)