我有一个排序的数字输入,如下所示:
1 1 10
1 12 18
1 16 30
1 30 40
2 35 45
DF = structure(list(V1 = c(1L, 1L, 1L, 1L, 2L), V2 = c(1L, 12L, 16L,
30L, 35L), V3 = c(10L, 18L, 30L, 40L, 45L)), .Names = c("V1",
"V2", "V3"), row.names = c(NA, -5L), class = "data.frame")
按第一列排序,然后按第二列排序。现在我正在尝试在R中设计一个高效的函数(因为我的输入是数十万行),它可以合并重叠的行,例如,行2和3在三个地方重叠(16,17和18) ,而第3行和第4行在一个位置(30)重叠,而第5行以2开始,因此它与其余部分分开。总而言之,我想得到:
1 1 10
1 12 40
2 35 45
然而,我正在努力合并一个参数,说“如果两行彼此足够接近,例如,在5个单位内”,然后合并它们,否则不要。在这种情况下,我想得到:
1 1 40
2 35 45
因为12-10 = 2<但是如果参数设置为1,那么输出就是原始的:
1 1 10
1 12 40
2 35 45
答案 0 :(得分:2)
以这种方式:
library(data.table)
setDT(DT)
th = 5
DT[, g := cumsum(V2 - shift(V3, fill = first(V2)) >= th), by=V1]
DT[, .(V2 = first(V2), V3 = last(V3)), by=.(V1, g = rleid(V1, g))]
# V1 g V2 V3
# 1: 1 1 1 40
# 2: 2 2 35 45
# same code with th = 1
# V1 g V2 V3
# 1: 1 1 1 10
# 2: 1 2 12 40
# 3: 2 3 35 45
答案 1 :(得分:0)
这适用于您的玩具示例
df <- data.frame(ID=c(1,1,1,1,2),
X1=c(1,12,16,30,35),
X2=c(10,18,30,40,45))
df %>%
group_by(ID) %>% # group wise operation by ID
mutate(lg=lag(X2+5,default=head(X2,1))) %>% # for comparison, offset X2 column by +1 row and add value of 5
mutate(lt=lg<=X2) %>% # check if lg <= X2
mutate(group=cumsum(lt != lag(lt,default=head(lt,1)))) %>% # make groups based on logical test in lt
group_by(ID,group) %>%
summarise(X1=min(X1), X2=max(X2)) # summarise data as min and max of X1 and X2 respectively
输出
ID group X1 X2
1 1 0 1 40
2 2 0 35 45