按连续值出现分组

时间:2019-03-14 17:57:32

标签: r dplyr

我遇到了一个问题,该问题迫使我使用循环而不是首选的dplyr管道流。

我想基于相同值的连续观察对行进行分组。 例如,如果type的前四个观察值等于a,则前四个观察值应分配给同一组。顺序很重要,所以我不能dplyr::group_bydplyr::summarize

下面的代码应该可以很好地说明问题。我想知道是否有人可以提议一种更简单的方法来做到这一点,最好使用tidyverse软件包,而不是data.tables

library(tidyverse)

# Crete some test data
df <- tibble(
  id = 1:20,
  type = c(rep("a", 5), rep("b", 5), rep("a", 5), rep("b", 5)),
  val = runif(20)
)

df
#> # A tibble: 20 x 3
#>       id type     val
#>    <int> <chr>  <dbl>
#>  1     1 a     0.0606
#>  2     2 a     0.501 
#>  3     3 a     0.974 
#>  4     4 a     0.0833
#>  5     5 a     0.752 
#>  6     6 b     0.0450
#>  7     7 b     0.367 
#>  8     8 b     0.649 
#>  9     9 b     0.846 
#> 10    10 b     0.896 
#> 11    11 a     0.178 
#> 12    12 a     0.295 
#> 13    13 a     0.206 
#> 14    14 a     0.233 
#> 15    15 a     0.851 
#> 16    16 b     0.179 
#> 17    17 b     0.801 
#> 18    18 b     0.326 
#> 19    19 b     0.269 
#> 20    20 b     0.584

# Solve problem with a loop
count <- 1
df$consec_group <- NA
for (i in 1:nrow(df)) {
  current <- df$type[i]
  lag <- ifelse(i == 1, NA, df$type[i - 1])
  lead <- ifelse(i == nrow(df), NA, df$type[i + 1])

  if (lead %>% is.na) {
    df$consec_group[i] <- ifelse(current == lag, count, count + 1) 
  } else {
    df$consec_group[i] <- count 
    if (current != lead) count <- count + 1
  }
}

df
#> # A tibble: 20 x 4
#>       id type     val consec_group
#>    <int> <chr>  <dbl>        <dbl>
#>  1     1 a     0.0606            1
#>  2     2 a     0.501             1
#>  3     3 a     0.974             1
#>  4     4 a     0.0833            1
#>  5     5 a     0.752             1
#>  6     6 b     0.0450            2
#>  7     7 b     0.367             2
#>  8     8 b     0.649             2
#>  9     9 b     0.846             2
#> 10    10 b     0.896             2
#> 11    11 a     0.178             3
#> 12    12 a     0.295             3
#> 13    13 a     0.206             3
#> 14    14 a     0.233             3
#> 15    15 a     0.851             3
#> 16    16 b     0.179             4
#> 17    17 b     0.801             4
#> 18    18 b     0.326             4
#> 19    19 b     0.269             4
#> 20    20 b     0.584             4

由reprex软件包(v0.2.1)于2019-03-14创建

这种连续type出现的分组实际上只是一个中间步骤。我的残局是根据先前val内发生的consec_group的值,对给定的val操纵consec_group。有关相关软件包的建议将不胜感激。

2 个答案:

答案 0 :(得分:4)

您说“没有data.tables”,但是您确定吗? ***如此简单快捷(在这种情况下)...

library(data.table)
setDT(df)[, groupid := rleid(type)][]

#     id type         val groupid
#  1:  1    a 0.624078793       1
#  2:  2    a 0.687361541       1
#  3:  3    a 0.817702740       1
#  4:  4    a 0.669857208       1
#  5:  5    a 0.100977936       1
#  6:  6    b 0.418275823       2
#  7:  7    b 0.660119857       2
#  8:  8    b 0.876015209       2
#  9:  9    b 0.473562143       2
# 10: 10    b 0.284474633       2
# 11: 11    a 0.034154862       3
# 12: 12    a 0.391760387       3
# 13: 13    a 0.383107868       3
# 14: 14    a 0.729583433       3
# 15: 15    a 0.006288375       3
# 16: 16    b 0.530179235       4
# 17: 17    b 0.802643704       4
# 18: 18    b 0.409618633       4
# 19: 19    b 0.309363642       4
# 20: 20    b 0.021918512       4

如果您坚持使用tidyverse / dplyr,您仍然可以(当然)使用 rleid-功能如下:

df %>% mutate( groupid = data.table::rleid(type) )

基准

在更大的样本上

library(tidyverse)
library(data.table)

# Crete some large test data
df <- tibble(
  id = 1:200000,
  type = sample(letters[1:26], 200000, replace = TRUE),
  val = runif(200000)
)

dt <- as.data.table(df)

microbenchmark::microbenchmark(
  dplyr.rleid      = df %>% mutate( groupid = data.table::rleid(type) ),
  data.table.rleid = dt[, groupid := rleid(type)][], 
  rle = df %>% mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)}),
  rle2 = df %>% mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths))),
  transform = transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths))),
  times = 10)

# Unit: milliseconds
#             expr       min        lq      mean    median        uq        max neval
#      dplyr.rleid  3.153626  3.278049  3.410363  3.444949  3.502792   3.582626    10
# data.table.rleid  2.965639  3.065959  3.173992  3.145643  3.259672   3.507009    10
#              rle 13.059774 14.042797 24.364176 26.126176 29.460561  36.874054    10
#             rle2 12.641319 13.553846 30.951152 24.698338 34.139786 102.791719    10
#        transform 12.330717 22.419128 22.725242 25.532084 26.187634  26.702794    10

答案 1 :(得分:3)

您可以使用类似const fetchStories = () => { let response = await fetch(TOP_STORIES); let storyIds = await response.json(); let urls = []; storyIds.forEach(function(storyId) { urls.push(`https://someurl/v0/item/${storyId}.json?print=pretty`); }); Promise.all( urls.map(url => fetch(url) .then(response => response.json()) .catch(err => console.error(err)) ) ).then(stories => stories.forEach(story => displayStory(story))); } 的可能性:

rleid()

或一种使它更方便的修改(最初由@ d.b提出):

df %>%
 mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)})

      id type     val ID_rleid
   <int> <chr>  <dbl>    <int>
 1     1 a     0.0430        1
 2     2 a     0.858         1
 3     3 a     0.504         1
 4     4 a     0.318         1
 5     5 a     0.469         1
 6     6 b     0.144         2
 7     7 b     0.173         2
 8     8 b     0.0706        2
 9     9 b     0.958         2
10    10 b     0.557         2
11    11 a     0.358         3
12    12 a     0.973         3
13    13 a     0.982         3
14    14 a     0.177         3
15    15 a     0.599         3
16    16 b     0.627         4
17    17 b     0.454         4
18    18 b     0.682         4
19    19 b     0.690         4
20    20 b     0.713         4