我有一个看起来像这样的RNA-seq数据:
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
@J00157:85:HNNJLBBXX:5:1101:12550:15574 1:N:0:ATTACTCG+TATAGCCT
GCTCTTCCGATCTGCTATTGATGACTGTCCTCTGTTCTTTCTTTCACAGTAGACGAGGACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTC
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
--
如果我们将@之后的所有内容视为一个部分,您只能看到第二行是真实的排序信息,1,3,4,5是逻辑/质量信息。
目标 提取在每行的前N(N = 35)个字符中包含“GCTGCA”的序列(第二行信息),以及同时输出周围的线(前面1行,匹配线后3行)。
答案示例是
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
我试过的是
awk 'substr($0, 1, 35) ~ "GCTGCA"' filename.fastq > newfile.fastq
grep -B 1 -A 2 -E GCTGCA filename.fastq > newfile.fastq
awk '{a[++i]=$0;}{substr(a[++i], 1, 35) ~ "GCTGCA"}{for(j=NR-1;j<=NR+2;j++)print a[j];}' filename.fastq > newfile.fastq
第一个不能输出周围的线条。第二个不能限制每行的前35个字母中的模式匹配。第三行应该可以工作,但它给了我有线输出(显然这是不正确的):
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
+
+
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
--
--
答案 0 :(得分:2)
gawk
多字符RS支持。
awk -v RS='\n--' -F'\n' 'substr($2,0,35)~"GCTGCA"{print $0 RS}' file
使用记录分隔符定义记录。
答案 1 :(得分:1)
awk
使用getline
:
search.awk :
substr($0,0,35)~"GCTGCA" {
print p # Print the previous line ...
print # ... , current line ...
for(i=0;i<=2;i++) { # ... and the 3 lines following it
getline
print
}
}
# Store the previous line
{ p = $0 }
这样称呼:
awk -f search.awk input_file
或没有正则表达式和参数:
search.awk
index(substr($0,0,35), search) {
print l
print
for(i=0;i<=2;i++) {
getline
print
}
}
{ l = $0 }
称之为
awk -v search="GCTGCA" -f search.awk input_file