Question

我有一个数值集，每个数值代表一个区域。

例如

x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)

我需要确定数据中是否存在重复的子序列，即对象是否从区域1重复旅行到2至3。在上面的示例1,2,3中给出的值为3。我没有知道了这些子序列后，我需要R才能提供给定的数据。

在此之后，我需要计算该子序列出现在数据中的次数。

如果这是一个简单的任务，那么非常基础的知识或R请原谅我的无知！

Answer 1

这是一种查找重复长度为n的序列以及重复多少次的方法

对于n = 3

library(tidyverse) # not necessary, see base version below

n <- 3
lapply(seq(0, length(x) - n), `+`, seq(n)) %>% # get index of all subsequences
  map_chr(~ paste(x[.], collapse = ',')) %>% # paste together as character
  table %>% # get number of times each occurs
  `[`(. > 1) # select sequences occurring > 1 time
# 1,2,3 
# 3

对于n = 2

n <- 2
lapply(seq(0, length(x) - n), `+`, seq(n)) %>% 
  map_chr(~ paste(x[.], collapse = ',')) %>% 
  table %>% 
  `[`(. > 1)
# 1,2 2,3 5,9 
# 3   3   2

没有Tidyverse

seqs <- lapply(seq(0, length(x) - n), `+`, seq(n))
seqs.char <- sapply(seqs, function(i) paste(x[i], collapse = ','))
tbl <- table(seqs.char)
tbl[tbl > 1]

我将添加自己的问题：有人知道如何在不首先转换为角色的情况下进行此操作吗？例如fun fun(list(1:2, 1:2, 2:3))告诉您1:2发生两次而2:3发生一次？

Answer 2

另一种tidyverse方法可根据您希望子序列具有多少个值来创建结果的大数据框：

library(tidyverse)

# example vector
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)

# function that gets as input number of consequtive elements in a subsequence
# and returns an ordered dataframe by counts of occurence
f = function(n) {

  data.frame(value = x) %>%               # get the vector x
    slice(1:(nrow(.)-n+1)) %>%            # remove values not needed from the end
    mutate(position = row_number()) %>%   # add position of each value
    rowwise() %>%                         # for each value/row
    mutate(vec = paste0(x[position:(position+n-1)], collapse = ",")) %>% # create subsequences as a string
    ungroup() %>%                         # forget the grouping
    count(vec, sort = T) }                # order by counts descending


2:5 %>%                    # specify how many values in your subsequences you want to investigate (let's say from 2 to 5)
  map_df(~ data.frame(NumElements = ., f(.))) %>%  # apply your function and keep the number values
  arrange(desc(n)) %>%     # order by counts descending
  tbl_df()                 # (only for visualisation purposes)


# # A tibble: 88 x 3
#   NumElements vec       n
#         <dbl> <chr> <int>
# 1           2 1,2       3
# 2           2 2,3       3
# 3           3 1,2,3     3
# 4           2 5,9       2
# 5           2 1,6       1
# 6           2 10,1      1
# 7           2 10,7      1
# 8           2 3,10      1
# 9           2 3,4       1
# 10          2 4,1       1
# # ... with 78 more rows

Answer 3

以下方法可查找任何长度（k）的序列：将输入向量转换为具有k行的矩阵；将k 0:(k-1)添加到开头NA's次。最后，对这些k矩阵中的所有行进行计数（paste将元素放在一起）：

frs <- function(x, k=2){
   padit <- function(.) c(.,rep(NA, k-length(.)%%k))
   xx <- lapply(1:k, function(iii) padit(c(rep(NA,iii-1), x)))
   xx <- do.call(rbind, lapply(xx, function(.) matrix(., ncol=k, byrow=TRUE)))
   xx <- sapply(split(xx, 1:NROW(xx)), paste, collapse=",")
   (function(x) x[x>1])(table(xx))

}

输出：

> frs(x,2)
xx
1,2 2,3 5,9 
  3   3   2 
> frs(x,3)
1,2,3 
    3 
> frs(x,4)
named integer(0)

如何识别数据集中的重复子序列

3 个答案: