我有一个包含672,343行的大文件,例如:
$ wc -l $GTF
672343 /data1/Annotation/iGenome/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf
$ head $GTF
chr1 unknown exon 3214482 3216968 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown stop_codon 3216022 3216024 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown CDS 3216025 3216968 . - 2 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown CDS 3421702 3421901 . - 1 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown exon 3421702 3421901 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown CDS 3670552 3671348 . - 0 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown exon 3670552 3671498 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown start_codon 3671346 3671348 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1 unknown exon 4290846 4293012 . - . gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";
chr1 unknown stop_codon 4292981 4292983 . - . gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";
第一个字段中的唯一值为:
$ cat $GTF | cut -f 1 | sort | uniq
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1_GL456211_random
chr1_GL456221_random
chr2
chr3
chr4
chr4_GL456216_random
chr4_GL456350_random
chr4_JH584292_random
chr4_JH584293_random
chr4_JH584294_random
chr5
chr5_GL456354_random
chr5_JH584296_random
chr5_JH584297_random
chr5_JH584298_random
chr5_JH584299_random
chr6
chr7
chr7_GL456219_random
chr8
chr9
chrUn_JH584304
chrX
chrX_GL456233_random
chrY
我想要实现的是删除包含“_”的第一个字段的行,并输出到具有相同格式的另一个文件。
答案 0 :(得分:4)
awk
救援!
awk '$1~/_/{print > "underscores"; next} 1' file
在第一个字段中打印带有“_”的记录到文件“下划线”,其余的将打印到stdout(你可以像往常一样重定向到输出文件)
答案 1 :(得分:2)
试试这个:
grep -E '^[^_ ]+ ' file.txt
答案 2 :(得分:0)
尽管我喜欢awk,但这很棒但是如果你有大文件,grep总会更快。特别是如果文件有900,000行.... imho grep只是一个更好的选择。
如果你有大文件,你应该通过ulitizing“&”分割成一个循环来运行多个进程这叫做分叉。