Question

问题是我得到了大文本文件。让它成为

 a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")

我需要将此文本中的每个第3个符号与值（例如'c'）进行比较，如果为true，我想将1添加到计数器i。我想使用grep但似乎这个函数不符合我的目的。所以我需要你的帮助或建议。

更重要的是，我想从这个字符串中提取某些值到一个向量。例如，我想提取4:10符号，例如

 a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")
[1] "gatcgatcga"

提前谢谢你。

P.S。

我知道在R中编写我需要的脚本不是最好的主意，但我很好奇是否有可能以适当的方式编写脚本。

Answer 1

编辑为更大的字符串提供快速解决方案：

如果你有一个很长的字符串（数百万个核苷酸的数量级），我原来的答案（下面）中的lookbehind断言太慢而不实用。在这种情况下，使用更像下面的内容，其中：（1）在每个字符之间拆分字符串; （2）使用字符填充三行矩阵;然后（3）提取矩阵第3行中的字符。这需要大约0.2秒来处理300万字符长的字符串。

## Make a 3-million character long string
a <- paste0(sample(c("a", "t", "c", "g"), 3e6, replace=TRUE), collapse="")

## Extract the third codon of each triplet
n3  <- matrix(strsplit(a, "")[[1]], nrow=3)[3,]

## Check that it works
sum(n3=="c")
# [1] 250431
table(n3)
#  n3
#      a      c      g      t 
# 250549 250431 249008 250012

原始回答：

在这两种情况下我都可以使用substr()。

## Split into codons. (The "lookbehind assertion", "(?<=.{3})" matches at each
## inter-character location that's preceded by three characters of  any type.)
codons <- strsplit(a, "(?<=.{3})", perl=TRUE)[[1]]
#  [1] "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg"

## Extract 3rd nucleotide in each codon
n3 <- sapply(codons, function(X) substr(X,3,3))
# atc gat cga tcg atc gat cga tcg atc gat cga tcg 
# "c" "t" "a" "g" "c" "t" "a" "g" "c" "t" "a" "g" 

## Count the number of 'c's
sum(n3=="c")
# [1] 3


## Extract nucleotides 4-10
substr(a, 4,10)
# [1] "gatcgat"

Answer 2

这是一种使用R原语的简单方法：

sum("c"==(strsplit(a,NULL))[[1]][c(FALSE,FALSE,TRUE)])
[1] 3  # this is the right answer.

布尔模式c(FALSE,FALSE,TRUE)被复制为与输入字符串一样长，然后用于索引它。它可以调整为匹配不同的元素或更长的长度（对于具有扩展密码子的那些）。

对于整个基因组而言，可能性能不够，但非常适合随意使用。

Answer 3

将每个第三个字符与"c"进行比较：

grepl("^(.{2}c)*.{0,2}$", a)
# [1] FALSE

提取字符4到10：

substr(a, 4, 10)
# [1] "gatcgat"

比较文本字符串的每个* nd符号

3 个答案: