R - 如何找到最长的重复序列及其频率

时间:2017-09-14 17:15:28

标签: r dataframe subsequence

我有一些看起来像这样的数据:

29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43

实际上有更多的数据,但我想保持这个简短。我想在R中找到一种方法来找到所有行的最长公共子序列,并按频率排序(递减),其中只报告序列中具有多个元素和多个频率的那些公共子序列。有没有办法在R?中做到这一点?

所以示例结果如下:

[29] 3
[30] 2 
...
( etc for all the single duplicates across each row and their frequencies )
...
[46  47] 2
[39  40  43] 3
[40, 43] 2

1 个答案:

答案 0 :(得分:0)

好像你在问两种不同的问题。您希望 1)列的单个值的连续运行长度和 2)计数(非连续)ngrams(按行进行)但按列计数。

library(tidyverse)
# single number contiguous runs by column
single <- Reduce("rbind", apply(df, 2, function(x) tibble(val=rle(x)$values, occurrence=rle(x)$lengths) %>% filter(occurrence>1)))

单个

的输出
    val occurrence
  <int>      <int>
1    29          3
2    30          2
3    40          2
4    43          2
5    43          2
# ngram numbers by row (count, non-contiguous)
restof <- Reduce("rbind", lapply(1:(ncol(df)-1), function(z) {
    nruns <- t(apply(df, 1, function(x) sapply(head(seq_along(x),-z), function(y) paste(x[y:(y+z)], collapse=" "))) )
    Reduce("rbind", apply(nruns, 2, function(x) tibble(val=names(table(x)), occurrence=c(table(x))) %>% filter(occurrence>1)))
}))

输出ngrams

       val occurrence
     <chr>      <int>
1    39 40          2
2    46 47          2
3    40 43          3
4 39 40 43          2

合并数据

ans <- rbind(single, restof)

输出

       val occurrence
     <chr>      <int>
1       29          3
2       30          2
3       40          2
4       43          2
5       43          2
6    39 40          2
7    46 47          2
8    40 43          3
9 39 40 43          2

您的数据

df <- read.table(text="29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43")