计算字符串的一部分数量

时间:2011-05-24 12:21:27

标签: r

我有一个这样的文本文件:

V1 V2   V3
X  N    aaaaaabbbabab
C  T    ababaaabaaabb
V  H    babbbabaabbba

我想要做的是计算每个V3列中的a和多少b。

所以输出会是这样的:

   col1  col2 col3 .......  col13
a  2     2    2             1
b  1     1    1             2

如何做到这一点?

我尝试了count函数和子字符串,但它没有用。

由于

3 个答案:

答案 0 :(得分:4)

假设dat包含您的数据,我们会使用strsplit()处理

tt <- matrix(unlist(strsplit(dat$V3, split = "")), ncol = 13, byrow = TRUE)

,并提供:

> tt
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "a"   "b"   "a"   "b"  
[2,] "a"  "b"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  "a"   "a"   "b"   "b"  
[3,] "b"  "a"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"   "b"   "b"   "a"

我们可以通过正确设置水平来获得所需的结果:

apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))

给出:

> apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

要自动选择适当的级别,我们可以执行以下操作:

> lev <- levels(factor(tt))
> apply(tt, 2, function(x, levels) c(table(factor(x, levels = lev))), 
+       levels = lev)
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

在第一行中,我们将tt视为向量,并在将tt临时转换为因子后提取级别。然后,我们将这些级别(lev)提供给apply()步骤,而不是明确说明级别。

答案 1 :(得分:2)

编辑:在Gavin Simpson发表评论后纠正了解决方案。现在可以使用了


为了避免很多转换因素,您可以使用以下技巧与索引和tapply:

tt <- c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba")

ttstr <- strsplit(tt,"")
ttf <- factor(unlist(ttstr))
n <- length(ttstr[[1]])
k <- length(ttstr)

> do.call(cbind,tapply(ttf,rep(1:n,k),table))
  1 2 3 4 5 6 7 8 9 10 11 12 13
a 2 2 2 1 2 3 1 1 2  2  1  1  1
b 1 1 1 2 1 0 2 2 1  1  2  2  2

这比@Gavin显示的方法加速了大约7倍

> benchmark(method1(tt),method2(tt),replications=1)
         test replications elapsed relative user.self 
1 method1(tt)            1    0.89 1.000000      0.89   
2 method2(tt)            1    6.99 7.853933      6.98     

答案 2 :(得分:0)

这是一个新版本,可以解决实际问题。仍在使用gregexpr,但这次使用索引。为了计算零计数单元格(我无法在表格中得到它),我必须先走开我的路?

foo <- data.frame(
    V1 = c("X","C","V"),
    V2 = c("N","T","H"),
    V3 = c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba"))

n <- nchar(as.character(foo$V3)[1])
tabA <- table(unlist(gregexpr("a",foo$V3)),exclude=-1)
tabB <- table(unlist(gregexpr("b",foo$V3)),exclude=-1)

res <- matrix(0,2,n)

res[1,as.numeric(names(tabA))] <- tabA
res[2,as.numeric(names(tabB))] <- tabB

rownames(res) <- c("a","b")
res
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

如果没有zerocount单元格,您只需执行rbind(tabA,tabB)