我有一个data.table,它由几个组组成(更具体的分层面板/经度数据集),并且组中的一个单元格看起来像这样
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11
为简单起见,组被排除在外,因此假设数据中只有两个变量(列):
我需要删除行之间差异小于0.5的观测值,所以我需要这样
# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11
最后,它满足邻居中的任何两个值在变量t的顺序上相差不小于0.5。
这样的data.table是否可能,但是更大,有几个组和近1亿个观测值。
谢谢你!
答案 0 :(得分:2)
如果我理解正确,您可以这样做:
library(data.table)
z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]
说明:
x
的最小值; cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))
创建。其中,我们为每一行检查x >= shift(x) + 0.5
(x
与上一行之间的差是否大于或等于0.5)。这等于TRUE
或FALSE
,我们用+
变成1和0;由于第一行将始终为NA
(因为没有前一行),因此我们在表达式后使用[-1]
将其删除。因为这意味着向量中将缺少第一个值,所以我们构造了另一个值,该值以1开头,后跟之前计算的值。之后,我们应用cumsum
-每次新行大于或等于前一行+ 0.5时,后者都会分配一个值;如果它们之间没有这样的行,它将继续分配最后一个数字(因为我们在向量的开头插入了1,所以它将从1开始,并且每次遇到满足条件的行都将增加+1 (不排除); 输出:
x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11
答案 1 :(得分:1)
由于间隙取决于行的顺序移除,因此下面的解决方案使用一种交互式方法来识别并重新计算移除行之后的后续间隙。
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")
find_gaps <- function(dt) {
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)
}
while(!is.na(gap<-find_gaps(z))) { z <- z[-gap] }
z
结果:
[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE
注意到8gb文件并着眼效率:建议使用旧的for loop()作为最有效的
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()
for (i in 2:length(x))
{
if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]
}
z1 <- z1[-(gaps)]
microbenchmark::microbenchmark(times=100,
forway={
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()
for (i in 2:length(x)) { if (abs(x[i]-last_x) < 0.5) { gaps <- c(gaps,i); } else { last_x <- x[i]; } }
z1 <- z1[-(gaps)]
},
datatableway={
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")
z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 | x == filt, ][, filt := NULL]
},
whileway={
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")
find_gaps <- function(dt) {
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]
}
while(!is.na(gap<-find_gaps(z3))) { z3 <- z3[-gap] }
}
)
(z1==z2) & (z2==z3[,.(x,t)])
Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE
答案 2 :(得分:0)
您可以使用dplyr::mutate
和filter
:
z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)
为了便于理解,我保留了diff
字段。您也可以在单个过滤器语句中执行此操作