如何用R总结连续的深度数据? 例如:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
给定的输出应该是:
Label Depth
Air 7
Wood 3
Stone 1
首先使用cummax()
删除负值,因为深度只能在这种特殊情况下增加。因此:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
现在通过max-min,您将获得的每个连续行的深度增加:(问题是如何执行此步骤)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
最后总结那些最大最小值,输出就是上面提到的值。
步骤尝试实现输出:
第一个明显的解决方案是例如Air:
diff(cummax(a[a$label=="Air",]$depth))
该解决方案消除了负数据,这是由于预期的深度不断增加所必需的。 问题是输出还考虑了每个连续子集之间的重要步骤。因此,Air的总和将是12而不是7。
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
更糟糕的是使用aggreagte的解决方案,例如:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
注意:过滤大跳跃的解决方案并不是我想要的。当然,你可以再次硬编码限制例如&lt; 2再次为Air的例子:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
给你几乎正确的结果,但不能像预期的那样工作。我很确定已经有了我正在寻找的功能,因为它对于许多不同的任务来说并不是一个不常见的问题。
我猜每个材质的每组连续行的最小值和最大值并将它们相加可能是一种可能的解决方案,但我不确定如何将函数仅应用于连续的子集。
答案 0 :(得分:1)
您可以使用data.table::rleid
按运行快速分组,或者如果您愿意,可以使用rle
重新构建它。之后,在任何语法中聚合都相当容易。在dplyr中,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
答案 1 :(得分:0)
使用aggregate
的基本R方法是
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
聚合的第一个参数计算累积最大值,就像OP对输入向量的上面所做的那样,使用cbind
提供计算向量的最终输出。第二个参数是分组参数。这使用的方法与rle
不同,后者计算差异的累积和。最后,第三个参数提供了通过取每个组的范围差来计算所需输出的函数。
返回
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
答案 2 :(得分:0)
data.table
方式(部分来自@alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]