在R

时间:2017-06-25 17:24:50

标签: r depth microsoft-r

如何用R总结连续的深度数据? 例如:

a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")), 
                depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))

给定的输出应该是:

Label Depth
Air    7
Wood   3
Stone  1

首先使用cummax()删除负值,因为深度只能在这种特殊情况下增加。因此:

   label depth
1    Air     1
2    Air     2
3    Air     3
4    Air     3
5    Air     4
6    Air     5
7   Wood     5
8   Wood     5
9   Wood     5
10  Wood     6
11  Wood     8
12   Air     9
13   Air     9
14   Air     9
15   Air    10
16 Stone    10
17 Stone    10
18 Stone    11
19 Stone    11
20   Air    11
21   Air    12
22   Air    12
23   Air    12
24   Air    13
25  Wood    14
26  Wood    14

现在通过max-min,您将获得的每个连续行的深度增加:(问题是如何执行此步骤)

   label depth
1   Air     4
2   Wood    3
3   Air     1
4   Stone   1
5   Air     2
5   Wood    0

最后总结那些最大最小值,输出就是上面提到的值。

步骤尝试实现输出:

第一个明显的解决方案是例如Air:

diff(cummax(a[a$label=="Air",]$depth))

该解决方案消除了负数据,这是由于预期的深度不断增加所必需的。 问题是输出还考虑了每个连续子集之间的重要步骤。因此,Air的总和将是12而不是7。

 [1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1

更糟糕的是使用aggreagte的解决方案,例如:

aggregate(depth~label, a, FUN=function(x){sum(x>0)})

注意:过滤大跳跃的解决方案并不是我想要的。当然,你可以再次硬编码限制例如&lt; 2再次为Air的例子:

sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])

给你几乎正确的结果,但不能像预期的那样工作。我很确定已经有了我正在寻找的功能,因为它对于许多不同的任务来说并不是一个不常见的问题。

我猜每个材质的每组连续行的最小值和最大值并将它们相加可能是一种可能的解决方案,但我不确定如何将函数仅应用于连续的子集。

3 个答案:

答案 0 :(得分:1)

您可以使用data.table::rleid按运行快速分组,或者如果您愿意,可以使用rle重新构建它。之后,在任何语法中聚合都相当容易。在dplyr中,

library(dplyr)

a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"), 
                depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))

a2 <- a %>% 
    # filter to rows where previous value is lower, equal, or NA
    filter(depth >= lag(depth) | is.na(lag(depth))) %>% 
    # group by label and its run
    group_by(label, run = data.table::rleid(label)) %>% 
    summarise(depth = max(depth) - min(depth))    # aggregate

a2 %>% arrange(run)    # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups:   label [3]
#>    label   run depth
#>   <fctr> <int> <dbl>
#> 1    Air     1     4
#> 2   Wood     2     3
#> 3    Air     3     1
#> 4  Stone     4     1
#> 5    Air     5     2
#> 6   Wood     6     0

a3 <- a2 %>% summarise(depth = sum(depth))    # a2 is still grouped, so aggregate more

a3
#> # A tibble: 3 x 2
#>    label depth
#>   <fctr> <dbl>
#> 1    Air     7
#> 2  Stone     1
#> 3   Wood     3

答案 1 :(得分:0)

使用aggregate的基本R方法是

aggregate(cbind(val=cummax(a$depth)),
          list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
          function(x) diff(range(x)))

聚合的第一个参数计算累积最大值,就像OP对输入向量的上面所做的那样,使用cbind提供计算向量的最终输出。第二个参数是分组参数。这使用的方法与rle不同,后者计算差异的累积和。最后,第三个参数提供了通过取每个组的范围差来计算所需输出的函数。

返回

  label ID val
1   Air  0   4
2  Wood  1   3
3   Air  2   1
4 Stone  3   1
5   Air  4   2
6  Wood  5   0

答案 2 :(得分:0)

data.table方式(部分来自@alistaire):

setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
  list(
    depth = max(depth) - depth[1],  # Only need the starting and max values
    label = label[1]
  ),
  by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]