我有大数据框,其中一个最小的工作示例是
df <- structure(list(
from = structure(
c(13858, 13859, 13860, 13861,
13864, 13865, 13866, 13867, 13868, 13871, 13871, 13872, 13873,
13874, 13875, 13878, 13878, 13879, 13880, 13881, 13882, 13885,
13886, 13887), class = "Date"),
to = structure(
c(13859, 13860,
13861, 13864, 13865, 13866, 13867, 13868, 13871, 13872, 13874,
13873, 13874, 13875, 13878, 13879, 13880, 13880, 13881, 13882,
13885, 13886, 13887, 13888), class = "Date"),
X1 = c(6, 5, 5, NA, NA, 4, 5, 4, 3, NA, NA, NA, NA, 6, 0, NA, NA, NA, 3,
5, 4, 5, 6, 10),
X2 = c(11, 5, 3, NA, 6, 10, 7, 3, 8, NA, 3, NA, NA, 7, 7, NA, 5, NA, 7,
4, 3, 2, 8, 8),
X3 = c(9, 3, 3, NA, 5, 7, 7, 6, 9, NA, 1, NA, NA, 6, 6, NA, 8, NA, 9, 2,
9, 4, 5, 9),
X4 = c(8, 5, 5, 4, 8, 8, 6, 5, 2, 4, NA, 10, 4, 4, 4, 5, NA, 4, 3, 3, 7,
3, 2, 1)),
.Names = c("from", "to", "X1", "X2", "X3", "X4"),
row.names = c(NA, -24L), class = "data.frame")
看起来像这样
from to X1 X2 X3 X4
1 2007-12-11 2007-12-12 6 11 9 8
2 2007-12-12 2007-12-13 5 5 3 5
3 2007-12-13 2007-12-14 5 3 3 5
4 2007-12-14 2007-12-17 NA NA NA 4
5 2007-12-17 2007-12-18 NA 6 5 8
6 2007-12-18 2007-12-19 4 10 7 8
7 2007-12-19 2007-12-20 5 7 7 6
8 2007-12-20 2007-12-21 4 3 6 5
9 2007-12-21 2007-12-24 3 8 9 2
10 2007-12-24 2007-12-25 NA NA NA 4
11 2007-12-24 2007-12-27 NA 3 1 NA
12 2007-12-25 2007-12-26 NA NA NA 10
13 2007-12-26 2007-12-27 NA NA NA 4
14 2007-12-27 2007-12-28 6 7 6 4
15 2007-12-28 2007-12-31 0 7 6 4
16 2007-12-31 2008-01-01 NA NA NA 5
17 2007-12-31 2008-01-02 NA 5 8 NA
18 2008-01-01 2008-01-02 NA NA NA 4
19 2008-01-02 2008-01-03 3 7 9 3
20 2008-01-03 2008-01-04 5 4 2 3
21 2008-01-04 2008-01-07 4 3 9 7
22 2008-01-07 2008-01-08 5 2 4 3
23 2008-01-08 2008-01-09 6 8 5 2
24 2008-01-09 2008-01-10 10 8 9 1
数据框由列from
和to
共同唯一索引,因为
anyDuplicated(df[c('from','to')])==0 # TRUE
但有一些重复的意思是(from
,to
)区间不对日期范围进行唯一分区,即
anyDuplicated(df['from'])>0 # TRUE
anyDuplicated(df['to'])>0 # TRUE
例如,2007年12月24日至2007年12月27日之间的(from
,to
)间隔也以三个子间隔的形式出现(2007年12月24日,12月25日) 。2007),(2007年12月25日,2007年12月26日)和(2007年12月26日,2007年12月27日)。
我想聚合此数据框,以便(from
,to
)中的任何日期间隔都不会相互重叠。我想对每个重复的数据列X1
,...,X4
中的值进行求和。在这个意义上。结果数据框应如下所示
from to X1 X2 X3 X4
1 2007-12-11 2007-12-12 6 11 9 8
2 2007-12-12 2007-12-13 5 5 3 5
3 2007-12-13 2007-12-14 5 3 3 5
4 2007-12-14 2007-12-17 NA NA NA 4
5 2007-12-17 2007-12-18 NA 6 5 8
6 2007-12-18 2007-12-19 4 10 7 8
7 2007-12-19 2007-12-20 5 7 7 6
8 2007-12-20 2007-12-21 4 3 6 5
9 2007-12-21 2007-12-24 3 8 9 2
10 2007-12-24 2007-12-27 NA 3 1 18
11 2007-12-27 2007-12-28 6 7 6 4
12 2007-12-28 2007-12-31 0 7 6 4
13 2007-12-31 2008-01-02 NA 5 8 9
14 2008-01-02 2008-01-03 3 7 9 3
15 2008-01-03 2008-01-04 5 4 2 3
16 2008-01-04 2008-01-07 4 3 9 7
17 2008-01-07 2008-01-08 5 2 4 3
18 2008-01-08 2008-01-09 6 8 5 2
19 2008-01-09 2008-01-10 10 8 9 1
我之前没遇到过这样的问题,而且我无法在stackoverflow上找到类似的问题。这似乎是一种比我习惯使用aggregate()
更复杂的聚合类型。所以任何解决方案,代码或参考资料都将受到赞赏。