我有一个数据集,该数据集包含一个包含由4个字母(A,T,C,G)组成的字符串的列;这些字符串的长度范围是2-1991个字符。我想对所有与特定模式匹配的行进行子集化。例如,我想创建一个新的数据框,该数据框将第17列中连续Ts为0-10的所有行作为子集。
如果您需要其他信息,请告诉我,谢谢您的宝贵时间!
答案 0 :(得分:1)
您可以过滤出找到11个连续T的所有行,其中包括具有11个连续T的行以及具有更多T的行。
## Example vector
v = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG")
v[!grepl("T{11}",v)]
[1] "TTTTTTTTTTACAGATAT" "TTTACACAC"
编辑以包括您要查找11-20个连续Ts的情况
如果要选择11到20 Ts之间的行,可以使用负向后看和负向前行来搜索11到20 Ts之间的延伸,该延伸既不在T之前也不在其后。 / p>
## Second example vector:
v2 = c("TTTTTTTTTTACAGATAT","TTTACACAC","TTTTTTTTTTTTTACAGAT","TTTTTTTTTTTACAG","ACTTTTTTTTTTTTTTTTTTTTTGCGCA")
v2[grepl("(?<!T)T{11,20}(?!T)",v2,perl=T)]
[1] "TTTTTTTTTTTTTACAGAT" "TTTTTTTTTTTACAG"