计算每组特定值的序列数

时间:2020-06-25 13:04:42

标签: r count sequence

可以说我有一个带有ID和一个变量的数据帧,其中的响应为ON或OFF。 我想计算每个组的“ ON”运行次数。我几乎已经做到了这一点,但意识到我的解决方案无法根据组中使用的是领先还是落后来处理组中的第一个或最后一个值。

我已经搜索过SO,可以找到类似的问题,但似乎与之完全不符。

id <- c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b","c", "c","c","c","c","c","c","c" )
category <- c("ON", "OFF", "OFF", "ON", "ON", "ON", "OFF", "OFF", "ON", "ON", "OFF", "OFF","OFF","OFF","OFF", "ON", "ON","ON")
dat<-data.frame(id, category)

到目前为止,我的尝试没有用,我认为是因为如果在组中以“开”开始运行,这是没有用的

summary(dat %>% group_by(id)%>% filter(category == "ON", lead(category!="ON"))%>% count(category) %>% arrange(n)) 

非常感谢您的协助。我的实际数据集是40,000行,带有120个ID,在每个ID中,类别可能以ON或OFF开头

输出将如下所示:

# id    category       n    
# a:1   OFF:0    Min.   :1  
# b:1   ON :2    1st Qu.:1  
# c:0            Median :1  
#                Mean   :1  
#                3rd Qu.:1  
#                Max.   :1 

因此解释将是2个id,在任何点上的运行次数均为“ ON”,而ON的运行次数(在此小样本中)的中位数为1

2 个答案:

答案 0 :(得分:1)

# step 1
out <- dat %>%
  group_by(id) %>%
  nest()

# outcome step 1
out
# # A tibble: 3 x 2
# # Groups:   id [3]
#   id    data            
#   <chr> <list>          
# 1 a     <tibble [5 x 1]>
# 2 b     <tibble [5 x 1]>
# 3 c     <tibble [8 x 1]>

# step 2
out <- out %>%
  mutate(run = map(data, ~ {
    out_map <- rle(.x$category)
    out_map <- tibble(length = out_map[[1]], category = out_map[[2]])
    return(out_map)
  })) %>%
  select(-data)

# outcome step 2
out
# # A tibble: 3 x 2
# # Groups:   id [3]
#   id    run             
#   <chr> <list>          
# 1 a     <tibble [3 x 2]>
# 2 b     <tibble [3 x 2]>
# 3 c     <tibble [2 x 2]>

# step 3
out <- out %>%
  unnest(cols = c(run)) %>%
  # this line lets you filter for category and the minimum line of the run
  filter(category == "ON", length > 1) %>%
  ungroup() %>%
  mutate_if(is.character, as_factor)
    
out
# # A tibble: 3 x 3
#   id    length category
#   <fct>  <int> <fct>   
# 1 a          2 ON      
# 2 b          2 ON      
# 3 c          3 ON      

count(out, id, category, sort = TRUE)
# # A tibble: 3 x 3
#   id    category     n
#   <fct> <fct>    <int>
# 1 a     ON           1
# 2 b     ON           1
# 3 c     ON           1

summary(out)
#  id        length      category
#  a:1   Min.   :2.000   ON:3    
#  b:1   1st Qu.:2.000           
#  c:1   Median :2.000           
#        Mean   :2.333           
#        3rd Qu.:2.500           
#        Max.   :3.000 

答案 1 :(得分:0)

base-R中我们可以使用

tapply(dat$category, dat$id, function(x) with(rle(as.character(x)),sum(values == "ON")))

a b c 
2 2 1