我有一个data.table,我想将Id用作另一个组,并且上一行与下一行之间的秒数之差超过300。
自动添加新列以及上一列的内容,并可以根据两列之间的秒数差异来判断是否需要添加几列
DT <-data.table(Id = c("A","A","A","A","A","B","B","B","B"),
valueA = c(479117,479119,479117,479118,479118,479118,479118,479118,479121),
valueB = c(209946,209948,209946,209953,209953,209953,209953,209951,209944),
second = c(0,745,12,5,50,938,114,339,705))
测试数据框
Id valueA valueB second
1 A 479117 209946 0
2 A 478419 209948 745
3 A 479117 209946 12
4 A 479118 209953 5
5 A 479118 209953 50
6 B 479118 209953 938
7 B 479118 209953 114
8 B 479118 209951 339
9 B 479121 209944 705
我希望转换后的数据框看起来像这样
Id valueA valueB second
1 A 479117 209946 0 #(original row 1)
#2 A 479117 209946 300 #(new row 2)
#3 A 479117 209946 300 #(new row 3)
4 A 478419 209948 745 #(original row 2)
5 A 479117 209946 12 #(original row 3)
6 A 479118 209953 5
7 A 479118 209953 50 #(original row 5)
Because original row 5 and original row 6 Id is not the same, so don't compare
8 B 479118 209953 938 #(original row 6)
9 B 479118 209953 114
10 B 479118 209951 339 #(original row 8)
#11 B 479118 209951 300 #(new row 11)
12 B 479121 209944 705 #(original row 9)
由于原始行1和原始行2之间的秒数为745,因此新行2和新行3将复制上一行的内容。为什么要复制两次,因为745/300 = 2.48(Round),要复制两次
原始行8和原始行9之间的秒数为366,因此新行11将复制上一行(8)的内容。为什么要复制一次,因为366/300 = 1.22,所以要复制一次(圆形)
我的原始数据有200万列
描述非常复杂。我不知道有什么办法吗?
谢谢。
答案 0 :(得分:0)
由于没有人想出一个聪明的解决方案,因此我将为您提供一种虽然比较宽松但可能可行的方法:
library(dplyr)
library(purrr)
grow_df <- function(x) {
seconds <- DT %>%
filter(Id == x) %>%
pull(second)
seconds2 <- c()
for (i in seq(along = seconds)) {
if (i == 1 || (i > 1 & seconds[i] - seconds[i - 1] <= 300)) {
seconds2 <- c(seconds2, seconds[i])
} else {
for(j in 1:floor((seconds[i] - seconds[i - 1]) / 300)) {
seconds2 <- c(seconds2, 300)
}
seconds2 <- c(seconds2, seconds[i])
}
}
return(tibble(Id = x, second = seconds2))
}
map(DT$Id %>% unique, grow_df) %>%
bind_rows() %>%
left_join(DT, by = c("Id", "second")) %>%
fill(valueA, valueB) %>%
select(Id, valueA, valueB, second)
注意:出于性能的考虑,您不应像使用seconds2
那样“增长”矢量。但这对示例有效。