我有一些看起来像这样的数据:
29 32 33 46 47 48
29 34 35 39 40 43
29 35 36 38 41 43
30 31 32 34 36 49
30 32 35 40 43 44
39 40 43 46 47 50
7 8 9 39 40 43
1 7 8 12 40 43
实际上有更多的数据,但我想保持这个简短。我想在R中找到一种方法来找到所有行的最长公共子序列,并按频率排序(递减),其中只报告序列中具有多个元素和多个频率的那些公共子序列。有没有办法在R?中做到这一点?
所以示例结果如下:
[29] 3
[30] 2
...
( etc for all the single duplicates across each row and their frequencies )
...
[46 47] 2
[39 40 43] 3
[40, 43] 2
答案 0 :(得分:0)
好像你在问两种不同的问题。您希望 1)列的单个值的连续运行长度和 2)计数(非连续)ngrams(按行进行)但按列计数。
library(tidyverse)
# single number contiguous runs by column
single <- Reduce("rbind", apply(df, 2, function(x) tibble(val=rle(x)$values, occurrence=rle(x)$lengths) %>% filter(occurrence>1)))
单个
的输出 val occurrence
<int> <int>
1 29 3
2 30 2
3 40 2
4 43 2
5 43 2
# ngram numbers by row (count, non-contiguous)
restof <- Reduce("rbind", lapply(1:(ncol(df)-1), function(z) {
nruns <- t(apply(df, 1, function(x) sapply(head(seq_along(x),-z), function(y) paste(x[y:(y+z)], collapse=" "))) )
Reduce("rbind", apply(nruns, 2, function(x) tibble(val=names(table(x)), occurrence=c(table(x))) %>% filter(occurrence>1)))
}))
输出ngrams
val occurrence
<chr> <int>
1 39 40 2
2 46 47 2
3 40 43 3
4 39 40 43 2
合并数据
ans <- rbind(single, restof)
输出
val occurrence
<chr> <int>
1 29 3
2 30 2
3 40 2
4 43 2
5 43 2
6 39 40 2
7 46 47 2
8 40 43 3
9 39 40 43 2
您的数据
df <- read.table(text="29 32 33 46 47 48
29 34 35 39 40 43
29 35 36 38 41 43
30 31 32 34 36 49
30 32 35 40 43 44
39 40 43 46 47 50
7 8 9 39 40 43
1 7 8 12 40 43")