我有一个如下所示的二进制序列:
set.seed(1)
n <- 1000
x <- sample(c(0,1), n, rep = TRUE)
如何找到连续2个,连续3个等的次数?例如,我可以使用
找到至少连续2个的次数length(which((x[-1] == 1) & (diff(x) == 0)))
答案 0 :(得分:4)
我们可以使用run-length-encoding(rle
)
with(rle(x), sum(values == 1 & lengths == 2))
即。
fn_len <- function(vec, val, n) {
with(rle(vec), sum(values == val & lengths == n))
}
fn_len(x, 1, 2)
#[1] 63
fn_len(x, 1, 3)
#[1] 34
如果我们需要获得多个元素的长度
sapply(2:5, fn_len, vec = x, val = 1)
#[1] 63 34 19 7
或其他选项rleid
来自data.table
library(data.table)
data.table(x)[, .N, .(x, rleid(x))][x==1, sum(N==2)]
#[1] 63
set.seed(1)
n <- 1e7
x <- sample(c(0, 1), n, replace = TRUE)
system.time(out1 <- table(scan(text=gsub("0+",";",paste0(x,collapse="")),
sep=";",quiet = T))[2])
# user system elapsed
# 11.818 0.152 11.976
system.time(out2 <- table(strsplit(gsub("0+",";",paste0(x,collapse="")),
";")[[1]])[3])
# user system elapsed
#10.708 0.200 10.913
system.time(fn_len(x, 1, 2))
# user system elapsed
# 0.671 0.399 1.073
如果我们想同时拥有多个'n',data.table
方法会更快
system.time(data.table(x)[, .N, .(x, rleid(x))][x==1, .N, N])
# user system elapsed
# 2.246 0.285 2.561
system.time(sapply(2:21, fn_len, vec = x, val = 1))
# user system elapsed
# 14.171 6.103 20.323
system.time(table(strsplit(gsub("0+",";",paste0(x,collapse="")),";")[[1]]))
# user system elapsed
# 10.570 0.192 10.770
答案 1 :(得分:2)
其他Base R方法
table(scan(text=gsub("0+",";",paste0(x,collapse="")),sep=";",quiet = T))
1 11 111 1111 11111 111111 111111111
114 63 34 19 7 3 1
甚至:
table(strsplit(gsub("0+",";",paste0(x,collapse="")),";")[[1]])
答案 2 :(得分:1)
您还可以使用rle
设置一个可以用不同方式查询的数据框,例如:
library(dplyr)
rle_x = rle(x)
results = data.frame(
x = x,
run_length = rep(rle_x$lengths, times = rle_x$lengths),
group = rep(1:length(rle_x$lengths), times = rle_x$lengths)
)
# Output:
# x run_length group
# 1 0 2 1
# 2 0 2 1
# 3 1 2 2
# 4 1 2 2
# 5 0 1 3
# Find runs of 1 with length exactly == 2
results %>%
filter(x == 1, run_length == 2) %>%
summarize(groups = n_distinct(group))
# Output:
# groups
# 1 63
# Runs of '1' of at least length 2:
results %>%
filter(x == 1, run_length >= 2) %>%
summarize(groups = n_distinct(group))
# groups
# 1 127