我有一个这样的文本文件:
V1 V2 V3
X N aaaaaabbbabab
C T ababaaabaaabb
V H babbbabaabbba
我想要做的是计算每个V3列中的a和多少b。
所以输出会是这样的:
col1 col2 col3 ....... col13
a 2 2 2 1
b 1 1 1 2
如何做到这一点?
我尝试了count函数和子字符串,但它没有用。
由于
答案 0 :(得分:4)
假设dat
包含您的数据,我们会使用strsplit()
处理
tt <- matrix(unlist(strsplit(dat$V3, split = "")), ncol = 13, byrow = TRUE)
,并提供:
> tt
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a" "a" "a" "a" "a" "a" "b" "b" "b" "a" "b" "a" "b"
[2,] "a" "b" "a" "b" "a" "a" "a" "b" "a" "a" "a" "b" "b"
[3,] "b" "a" "b" "b" "b" "a" "b" "a" "a" "b" "b" "b" "a"
我们可以通过正确设置水平来获得所需的结果:
apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
给出:
> apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
要自动选择适当的级别,我们可以执行以下操作:
> lev <- levels(factor(tt))
> apply(tt, 2, function(x, levels) c(table(factor(x, levels = lev))),
+ levels = lev)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
在第一行中,我们将tt
视为向量,并在将tt
临时转换为因子后提取级别。然后,我们将这些级别(lev
)提供给apply()
步骤,而不是明确说明级别。
答案 1 :(得分:2)
为了避免很多转换因素,您可以使用以下技巧与索引和tapply:
tt <- c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba")
ttstr <- strsplit(tt,"")
ttf <- factor(unlist(ttstr))
n <- length(ttstr[[1]])
k <- length(ttstr)
> do.call(cbind,tapply(ttf,rep(1:n,k),table))
1 2 3 4 5 6 7 8 9 10 11 12 13
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
这比@Gavin显示的方法加速了大约7倍
> benchmark(method1(tt),method2(tt),replications=1)
test replications elapsed relative user.self
1 method1(tt) 1 0.89 1.000000 0.89
2 method2(tt) 1 6.99 7.853933 6.98
答案 2 :(得分:0)
这是一个新版本,可以解决实际问题。仍在使用gregexpr
,但这次使用索引。为了计算零计数单元格(我无法在表格中得到它),我必须先走开我的路?
foo <- data.frame(
V1 = c("X","C","V"),
V2 = c("N","T","H"),
V3 = c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba"))
n <- nchar(as.character(foo$V3)[1])
tabA <- table(unlist(gregexpr("a",foo$V3)),exclude=-1)
tabB <- table(unlist(gregexpr("b",foo$V3)),exclude=-1)
res <- matrix(0,2,n)
res[1,as.numeric(names(tabA))] <- tabA
res[2,as.numeric(names(tabB))] <- tabB
rownames(res) <- c("a","b")
res
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
如果没有zerocount单元格,您只需执行rbind(tabA,tabB)
。