我有一个数值集,每个数值代表一个区域。
例如
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)
我需要确定数据中是否存在重复的子序列,即对象是否从区域1重复旅行到2至3。在上面的示例1,2,3中给出的值为3。我没有知道了这些子序列后,我需要R才能提供给定的数据。
在此之后,我需要计算该子序列出现在数据中的次数。
如果这是一个简单的任务,那么非常基础的知识或R请原谅我的无知!
答案 0 :(得分:4)
这是一种查找重复长度为n
的序列以及重复多少次的方法
对于n = 3
library(tidyverse) # not necessary, see base version below
n <- 3
lapply(seq(0, length(x) - n), `+`, seq(n)) %>% # get index of all subsequences
map_chr(~ paste(x[.], collapse = ',')) %>% # paste together as character
table %>% # get number of times each occurs
`[`(. > 1) # select sequences occurring > 1 time
# 1,2,3
# 3
对于n = 2
n <- 2
lapply(seq(0, length(x) - n), `+`, seq(n)) %>%
map_chr(~ paste(x[.], collapse = ',')) %>%
table %>%
`[`(. > 1)
# 1,2 2,3 5,9
# 3 3 2
没有Tidyverse
seqs <- lapply(seq(0, length(x) - n), `+`, seq(n))
seqs.char <- sapply(seqs, function(i) paste(x[i], collapse = ','))
tbl <- table(seqs.char)
tbl[tbl > 1]
我将添加自己的问题:有人知道如何在不首先转换为角色的情况下进行此操作吗?例如fun
fun(list(1:2, 1:2, 2:3))
告诉您1:2
发生两次而2:3
发生一次?
答案 1 :(得分:0)
另一种tidyverse
方法可根据您希望子序列具有多少个值来创建结果的大数据框:
library(tidyverse)
# example vector
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)
# function that gets as input number of consequtive elements in a subsequence
# and returns an ordered dataframe by counts of occurence
f = function(n) {
data.frame(value = x) %>% # get the vector x
slice(1:(nrow(.)-n+1)) %>% # remove values not needed from the end
mutate(position = row_number()) %>% # add position of each value
rowwise() %>% # for each value/row
mutate(vec = paste0(x[position:(position+n-1)], collapse = ",")) %>% # create subsequences as a string
ungroup() %>% # forget the grouping
count(vec, sort = T) } # order by counts descending
2:5 %>% # specify how many values in your subsequences you want to investigate (let's say from 2 to 5)
map_df(~ data.frame(NumElements = ., f(.))) %>% # apply your function and keep the number values
arrange(desc(n)) %>% # order by counts descending
tbl_df() # (only for visualisation purposes)
# # A tibble: 88 x 3
# NumElements vec n
# <dbl> <chr> <int>
# 1 2 1,2 3
# 2 2 2,3 3
# 3 3 1,2,3 3
# 4 2 5,9 2
# 5 2 1,6 1
# 6 2 10,1 1
# 7 2 10,7 1
# 8 2 3,10 1
# 9 2 3,4 1
# 10 2 4,1 1
# # ... with 78 more rows
答案 2 :(得分:0)
以下方法可查找任何长度(k
)的序列:将输入向量转换为具有k
行的矩阵;将k
0:(k-1)
添加到开头NA's
次。最后,对这些k
矩阵中的所有行进行计数(paste
将元素放在一起):
frs <- function(x, k=2){
padit <- function(.) c(.,rep(NA, k-length(.)%%k))
xx <- lapply(1:k, function(iii) padit(c(rep(NA,iii-1), x)))
xx <- do.call(rbind, lapply(xx, function(.) matrix(., ncol=k, byrow=TRUE)))
xx <- sapply(split(xx, 1:NROW(xx)), paste, collapse=",")
(function(x) x[x>1])(table(xx))
}
输出:
> frs(x,2)
xx
1,2 2,3 5,9
3 3 2
> frs(x,3)
1,2,3
3
> frs(x,4)
named integer(0)