删除包含linux中某些字符串的第一个字段的行

时间:2017-06-27 18:07:09

标签: bash

我有一个包含672,343行的大文件,例如:

$ wc -l $GTF
672343 /data1/Annotation/iGenome/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf

$ head $GTF
chr1    unknown exon    3214482 3216968 .       -       .       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown stop_codon      3216022 3216024 .       -       .       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown CDS     3216025 3216968 .       -       2       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown CDS     3421702 3421901 .       -       1       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown exon    3421702 3421901 .       -       .       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown CDS     3670552 3671348 .       -       0       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown exon    3670552 3671498 .       -       .       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown start_codon     3671346 3671348 .       -       .       gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1    unknown exon    4290846 4293012 .       -       .       gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";
chr1    unknown stop_codon      4292981 4292983 .       -       .       gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";

第一个字段中的唯一值为:

$ cat $GTF | cut -f 1 | sort | uniq
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1_GL456211_random
chr1_GL456221_random
chr2
chr3
chr4
chr4_GL456216_random
chr4_GL456350_random
chr4_JH584292_random
chr4_JH584293_random
chr4_JH584294_random
chr5
chr5_GL456354_random
chr5_JH584296_random
chr5_JH584297_random
chr5_JH584298_random
chr5_JH584299_random
chr6
chr7
chr7_GL456219_random
chr8
chr9
chrUn_JH584304
chrX
chrX_GL456233_random
chrY

我想要实现的是删除包含“_”的第一个字段的行,并输出到具有相同格式的另一个文件。

3 个答案:

答案 0 :(得分:4)

awk救援!

awk '$1~/_/{print > "underscores"; next} 1' file

在第一个字段中打印带有“_”的记录到文件“下划线”,其余的将打印到stdout(你可以像往常一样重定向到输出文件)

答案 1 :(得分:2)

试试这个:

grep -E '^[^_ ]+ ' file.txt

答案 2 :(得分:0)

尽管我喜欢awk,但这很棒但是如果你有大文件,grep总会更快。特别是如果文件有900,000行.... imho grep只是一个更好的选择。

如果你有大文件,你应该通过ulitizing“&”分割成一个循环来运行多个进程这叫做分叉。