请,我有这个data.frame:
10 34 35 39 55 43
10 32 33 40 45 48
10 35 36 38 41 43
30 31 32 34 36 49
39 55 40 43 45 50
30 32 35 36 49 50
2 8 9 39 55 43
1 2 8 12 55 43
2 8 12 55 43 61
2 8 55 43 61 78
我想找到所有行的所有序列(长度> 2),并按频率(频率> 1)分组。在这种情况下,需要显示
sequence frequency
[39 55 43] 3
[10 35 43] 2
[32 36 49] 2
[30 32 36] 2
[30 32 36 49] 2
[ 2 8 55] 4
[ 2 8 55 43] 4
[ 2 8 55 43 61] 2
是否可以在R中执行此操作?
答案 0 :(得分:7)
您可以编写函数subseqs
,该函数可以枚举每一行的所有子序列,然后使用table
subseqs <- function(v) sapply(3:length(v), function(k) combn(v,k,FUN = toString))
f <- table(unlist(apply(df, 1, subseqs)),dnn = "sequence")
dfout <- data.frame(f[f>=2])
如此
> dfout
sequence Freq
1 10, 35, 43 2
2 12, 55, 43 2
3 2, 12, 43 2
4 2, 12, 55 2
5 2, 12, 55, 43 2
6 2, 43, 61 2
7 2, 55, 43 4
8 2, 55, 43, 61 2
9 2, 55, 61 2
10 2, 8, 12 2
11 2, 8, 12, 43 2
12 2, 8, 12, 55 2
13 2, 8, 12, 55, 43 2
14 2, 8, 43 4
15 2, 8, 43, 61 2
16 2, 8, 55 4
17 2, 8, 55, 43 4
18 2, 8, 55, 43, 61 2
19 2, 8, 55, 61 2
20 2, 8, 61 2
21 30, 32, 36 2
22 30, 32, 36, 49 2
23 30, 32, 49 2
24 30, 36, 49 2
25 32, 36, 49 2
26 39, 55, 43 3
27 55, 43, 61 2
28 8, 12, 43 2
29 8, 12, 55 2
30 8, 12, 55, 43 2
31 8, 43, 61 2
32 8, 55, 43 4
33 8, 55, 43, 61 2
34 8, 55, 61 2
数据
df <- structure(list(V1 = c(10L, 10L, 10L, 30L, 39L, 30L, 2L, 1L, 2L,
2L), V2 = c(34L, 32L, 35L, 31L, 55L, 32L, 8L, 2L, 8L, 8L), V3 = c(35L,
33L, 36L, 32L, 40L, 35L, 9L, 8L, 12L, 55L), V4 = c(39L, 40L,
38L, 34L, 43L, 36L, 39L, 12L, 55L, 43L), V5 = c(55L, 45L, 41L,
36L, 45L, 49L, 55L, 55L, 43L, 61L), V6 = c(43L, 48L, 43L, 49L,
50L, 50L, 43L, 43L, 61L, 78L)), class = "data.frame", row.names = c(NA,
-10L))