Question

我有一些看起来像这样的数据：

29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43

实际上有更多的数据，但我想保持这个简短。我想在R中找到一种方法来找到所有行的最长公共子序列，并按频率排序（递减），其中只报告序列中具有多个元素和多个频率的那些公共子序列。有没有办法在R？中做到这一点？

所以示例结果如下：

[29] 3
[30] 2 
...
( etc for all the single duplicates across each row and their frequencies )
...
[46  47] 2
[39  40  43] 3
[40, 43] 2

Answer 1

好像你在问两种不同的问题。您希望 1）列的单个值的连续运行长度和 2）计数（非连续）ngrams（按行进行）但按列计数。

library(tidyverse)
# single number contiguous runs by column
single <- Reduce("rbind", apply(df, 2, function(x) tibble(val=rle(x)$values, occurrence=rle(x)$lengths) %>% filter(occurrence>1)))

单个

的输出

    val occurrence
  <int>      <int>
1    29          3
2    30          2
3    40          2
4    43          2
5    43          2

# ngram numbers by row (count, non-contiguous)
restof <- Reduce("rbind", lapply(1:(ncol(df)-1), function(z) {
    nruns <- t(apply(df, 1, function(x) sapply(head(seq_along(x),-z), function(y) paste(x[y:(y+z)], collapse=" "))) )
    Reduce("rbind", apply(nruns, 2, function(x) tibble(val=names(table(x)), occurrence=c(table(x))) %>% filter(occurrence>1)))
}))

输出ngrams

       val occurrence
     <chr>      <int>
1    39 40          2
2    46 47          2
3    40 43          3
4 39 40 43          2

合并数据

ans <- rbind(single, restof)

输出

       val occurrence
     <chr>      <int>
1       29          3
2       30          2
3       40          2
4       43          2
5       43          2
6    39 40          2
7    46 47          2
8    40 43          3
9 39 40 43          2

您的数据

df <- read.table(text="29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43")

R - 如何找到最长的重复序列及其频率

1 个答案: