如何丢弃行间差异小于特定值的观察值

时间:2019-03-08 11:03:30

标签: r data.table

我有一个data.table,它由几个组组成(更具体的分层面板/经度数据集),并且组中的一个单元格看起来像这样

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
            t = as.Date(c(27, 32:34, 36:41))) 
# that is:
#        x          t
#  1: 10.0 1970-01-28
#  2: 10.5 1970-02-02
#  3: 11.1 1970-02-03
#  4: 14.0 1970-02-04
#  5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
#  6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
#  7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
#  8: 17.0 1970-02-09
#  9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11

为简单起见,组被排除在外,因此假设数据中只有两个变量(列):

我需要删除行之间差异小于0.5的观测值,所以我需要这样

#        x          t
#  1: 10.0 1970-01-31
#  2: 10.5 1970-02-02
#  3: 11.1 1970-02-03
#  4: 14.0 1970-02-04
#  7: 14.6 1970-02-08
#  8: 17.0 1970-02-09
# 10: 30.0 1970-02-11

最后,它满足邻居中的任何两个值在变量t的顺序上相差不小于0.5。

这样的data.table是否可能,但是更大,有几个组和近1亿个观测值。

谢谢你!

3 个答案:

答案 0 :(得分:2)

如果我理解正确,您可以这样做:

library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
  , filt := ifelse(x == filt, 
                   shift(x, fill = x[1]), 
                   filt)][
                     x - filt >= 0.5 | x == filt, ][, filt := NULL]

说明:

  • 首先,我们计算各组x的最小值;
  • 组由cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))创建。其中,我们为每一行检查x >= shift(x) + 0.5x与上一行之间的差是否大于或等于0.5)。这等于TRUEFALSE,我们用+变成1和0;由于第一行将始终为NA(因为没有前一行),因此我们在表达式后使用[-1]将其删除。因为这意味着向量中将缺少第一个值,所以我们构造了另一个值,该值以1开头,后跟之前计算的值。之后,我们应用cumsum-每次新行大于或等于前一行+ 0.5时,后者都会分配一个值;如果它们之间没有这样的行,它将继续分配最后一个数字(因为我们在向量的开头插入了1,所以它将从1开始,并且每次遇到满足条件的行都将增加+1 (不排除);
  • 每行先前创建的组中只有1行;在这种情况下,我们需要交叉检查与上一行的确切差异。在所有其他情况下,我们会与该组的第一行进行交叉检查(即最后一行,因为它大于上一个+ 0.5,因此不应根据标准删除);
  • 在那之后,我们只删除那些不满足条件的行,然后保留等于它的行(将始终是第一行);我们最后删除了过滤变量。

输出:

      x          t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11

答案 1 :(得分:1)

由于间隙取决于行的顺序移除,因此下面的解决方案使用一种交互式方法来识别并重新计算移除行之后的后续间隙。

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
                t = as.Date(c(27, 32:34, 36:41))) 
setkeyv(z,"t")

find_gaps <- function(dt) {
  dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
  gaps <- dt[,abs(x-last_x) < 0.5,]
  gap <- which(gaps==TRUE)[1]
  #print(paste0("Removing row: ",gap))
  return (gap)
}

while(!is.na(gap<-find_gaps(z))) { z <- z[-gap] }

z

结果:

[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
      x          t last_x   gap
1: 10.0 1970-01-28     NA FALSE
2: 10.5 1970-02-02   10.0 FALSE
3: 11.1 1970-02-03   10.5 FALSE
4: 14.0 1970-02-04   11.1 FALSE
5: 14.6 1970-02-08   14.0 FALSE
6: 17.0 1970-02-09   14.6 FALSE
7: 30.0 1970-02-11   17.0 FALSE

备用

注意到8gb文件并着眼效率:建议使用旧的for loop()作为最有效的

z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))
{
  if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
  else last_x <- x[i]
}
z1 <- z1[-(gaps)]

基准化

microbenchmark::microbenchmark(times=100,
  forway={
    z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
    x <- z1$x; last_x <- x[1];  gaps <- c()

    for (i in 2:length(x)) { if (abs(x[i]-last_x) < 0.5) { gaps <- c(gaps,i); } else { last_x <- x[i]; } }
    z1 <- z1[-(gaps)]
  },
  datatableway={
    z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

    z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 | x == filt, ][, filt := NULL]
  },
  whileway={
    z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

    find_gaps <- function(dt) {
      dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
      gaps <- dt[,abs(x-last_x) < 0.5,]
      which(gaps==TRUE)[1]
    }
    while(!is.na(gap<-find_gaps(z3))) { z3 <- z3[-gap] }
  }
)

(z1==z2) & (z2==z3[,.(x,t)])

结果:

Unit: milliseconds
         expr       min        lq      mean    median        uq      max neval
       forway  2.741609  3.607341  4.067566  4.069382  4.556219  5.61997   100
 datatableway  7.552005  8.915333  9.839475  9.606205 10.762764 15.46430   100
     whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271   100
> 
> (z1==z2) & (z2==z3[,.(x,t)])
        x    t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE

答案 2 :(得分:0)

您可以使用dplyr::mutatefilter

z %>%
  mutate(diff = lead(x, 1) - x) %>%
  filter(diff >= 0.5 | is.na(diff)) %>%
  select(-diff)

为了便于理解,我保留了diff字段。您也可以在单个过滤器语句中执行此操作