如何根据范围分割数据

时间:2019-02-06 14:46:08

标签: r

我需要根据值的范围将数据分为多个组。范围是20到73。

如下图所示,我需要分成3个不同的组。您还应该注意,在某个值达到70-73范围后,下一个值将在40左右,然后下降到20,因此过渡是逐渐的。

我不关心瞬时值,并打算忽略它们。

样本数据:

structure(list(V1 = c(27, 28, 34, 35, 47, 50, 52, 54, 55, 68, 
                      69, 73, 45, 39, 30, 21, 23, 24, 22, 26, 
                      29, 31, 32, 35, 42, 44, 46, 50, 55, 66, 
                      69, 70, 47, 40, 33, 21, 22, 29, 31, 38, 
                      47, 55, 59, 64, 66, 71)), 
          class = "data.frame", 
          row.names = c(NA, -45L))

我尝试的代码:

df[, ID := cumsum(V1>=73)+1]

Expected Output

Plot of data & Expected Grouping

2 个答案:

答案 0 :(得分:1)

也许这对您有用:

library(dplyr)

df %>%
  group_by(groups = cumsum(coalesce(as.numeric(V1 < lag(V1) & lag(V1) >= 70), 1))) %>%
  filter(!coalesce(lead(cumsum(coalesce(as.numeric(V1 > lag(V1)), 1))), 99) == 1) %>%
  arrange(groups, V1)

输出:

   V1 groups
1  27      1
2  28      1
3  34      1
4  35      1
5  47      1
6  50      1
7  52      1
8  54      1
9  55      1
10 68      1
11 69      1
12 73      1
13 21      2
14 22      2
15 23      2
16 24      2
17 26      2
18 29      2
19 31      2
20 32      2
21 35      2
22 42      2
23 44      2
24 46      2
25 50      2
26 55      2
27 66      2
28 69      2
29 70      2
30 21      3
31 22      3
32 29      3
33 31      3
34 38      3
35 47      3
36 55      3
37 59      3
38 64      3
39 66      3
40 71      3

数据:

df <- structure(list(V1 = c(27, 28, 34, 35, 47, 50, 52, 54, 55, 68, 
69, 73, 45, 39, 30, 21, 23, 24, 22, 26, 29, 31, 32, 35, 42, 44, 
46, 50, 55, 66, 69, 70, 47, 40, 33, 21, 22, 29, 31, 38, 47, 55, 
59, 64, 66, 71)), class = "data.frame", row.names = c(NA, -46L
))

答案 1 :(得分:1)

这里是dplyr的另一种选择:

df2 <- df %>% 
  mutate(high_val = if_else(V1 %in% tail(sort(V1),3), 1, 0)) %>%
  mutate(cs_val   = 1 + lag(cumsum(high_val))) %>%
  replace_na(list(cs_val = 1, y = "unknown")) %>% 
  group_by(cs_val) %>%
  mutate(counter  =  row_number(cs_val)) %>%
  mutate(min_val  =  if_else(V1 == min(V1), 1, 0)) %>%
  mutate(cs_count =  cumsum(min_val)) %>% 
  filter(cs_count != 0) %>% 
  select(V1, groups = cs_val)

不确定这是否比接受的答案复杂。基本上,我创建了一堆列来跟踪设置的组中的最大值和最小值,并过滤掉瞬时值。

结果:

# A tibble: 40 x 2
      V1 groups
   <dbl>  <dbl>
 1    27      1
 2    28      1
 3    34      1
 4    35      1
 5    47      1
 6    50      1
 7    52      1
 8    54      1
 9    55      1
10    68      1
11    69      1
12    73      1
13    21      2
14    23      2
15    24      2
16    22      2
17    26      2
18    29      2
19    31      2
20    32      2
21    35      2
22    42      2
23    44      2
24    46      2
25    50      2
26    55      2
27    66      2
28    69      2
29    70      2
30    21      3
31    22      3
32    29      3
33    31      3
34    38      3
35    47      3
36    55      3
37    59      3
38    64      3
39    66      3
40    71      3