我有一个.fastq文件,格式如下:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (name)
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG (sequence)
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG (quality)
对于每个序列,格式是相同的(重复4行)我想做的是在第二行n = 35个字符的窗口中搜索特定的正则表达式模式,将其剪切(如果找到)并报告给前一行的结尾。
到目前为止,诺顿博士共享了一个不错的脚本来搜索正则表达式,将其提取并在读取的标题处报告(第一行) 不幸的是,由于索引(特别是RSTART)错误,因此我无法提取字符串是否位于行尾。
报告的代码在搜索FtgtRegexp([A-Z] {5} ACA [A-Z] {5} ACA [A-Z] {5}时能完成工作
BEGIN {
FtgtRegexp = "[A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
输入:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG
输出:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 CATCTACATATTCACATATAG
ACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG
如果我尝试略微修改脚本以提取位于第二行末尾的第二个正则表达式“ RtgtRegexp”,则会给我错误的输出,因为它报告了匹配项的错误RSTART:
BEGIN {
FtgtRegexp = "[A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}"
RtgtRegexp = "[A-Z]{5}TGT[A-Z]{5}TGT[A-Z]{5}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],(length(rec[2])-winLgth+1),winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
输入:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG
所需的输出:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 CAGTATGTAGGACTGTAACAT
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCT
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGF
实际输出
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 ATTCACATATAGACATGAAAC
ACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG