我有一个像这个小例子的大文件:
chr1 HAVANA transcript 69091 70008 . + . gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1 HAVANA exon 69091 70008 . + . gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1; exon_id "ENSE00002319515.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1 HAVANA CDS 69091 70005 . + 0 gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1; exon_id "ENSE00002319515.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
每行以" chr
"开头。我想创建一个新文件,其中第3列是" CDS
"。我怎样才能进行有条件的grep
?我使用了以下代码:
grep -i CDS infile.txt > outfile
但是无论列数如何,这一行都会返回CDS
的所有行。你知道怎么解决吗?
我想从小例子中得到这个:
chr1 HAVANA CDS 69091 70005 . + 0 gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1; exon_id "ENSE00002319515.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
答案 0 :(得分:1)
干净的解决方案是使用awk:
显式检查第三列awk '$3 == "CDS"' infile.txt
对于您的有限样本,看起来其他行上的所有CDS
匹配都是较长字的一部分,所以
grep -w 'CDS' infile.txt
也可以通过要求匹配成为确切的单词,但这只是基于您展示的有限样本。
检查第三列的grep解决方案可能如下所示(\s
,\S
和\>
需要GNU grep):
grep -E '^(\S+\s+){2}CDS\>' infile.txt
符合或POSIX:
grep -E '^([^[:blank:]]+[[:blank:]]+){2}CDS([[:blank:]]|$)' infile.txt