基于前8列合并制表符分隔的文件

时间:2017-09-25 11:54:01

标签: unix text awk merge

我有一个制表符分隔的文件(我们称之为file1),如下所示:

NC_027300.1 Gnomon  exon    5501    5691    .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    16966   17019   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    23978   24241   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    43486   43714   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    61647   62139   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 5501    5691    .   -   2   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 16966   17019   .   -   2   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 23978   24241   .   -   2   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 43486   43633   .   -   0   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    160437  160638  .   -   .   gene_id "2"; transcript_id "2.1";
NC_027300.1 Gnomon  exon    160913  161019  .   -   .   gene_id "2"; transcript_id "2.1";

一个更大的制表符分隔文件(file2),如下所示:

NC_027300.1 Gnomon  gene    5501    62139   .   -   .   ID=gene0;Dbxref=GeneID:106560212;Name=LOC106560212;gbkey=Gene;gene=LOC106560212;gene_biotype=protein_coding
NC_027300.1 Gnomon  mRNA    5501    62139   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;Name=XM_014160784.1;gbkey=mRNA;gene=LOC106560212;model_evidence=Supporting evidence includes similarity to: 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    61647   62139   .   -   .   ID=id1;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    43486   43714   .   -   .   ID=id2;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    23978   24241   .   -   .   ID=id3;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    16966   17019   .   -   .   ID=id4;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    5501    5691    .   -   .   ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  CDS 43486   43633   .   -   0   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
NC_027300.1 Gnomon  CDS 23978   24241   .   -   2   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
NC_027300.1 Gnomon  CDS 16966   17019   .   -   2   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1

我想创建一个新文件,其中只包含file1中也存在于file2中的行,这些行基于前8列,其中file1的所有9列和file2的第9列作为第10列。像这样:

NC_027300.1 Gnomon  exon    5501    5691    .   -   .   gene_id "1"; transcript_id "1.1"; ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1

我一直在努力关注this example,这是(凭借我非常有限的知识)我想出的:

awk 'NR==FNR{a[$1,$2,$3,$4,$5,$6,$7,$8]=$10;next} ($1,$2,$3,$4,$5,$6,$7,$8) in a{print $0, a[$$1,$2,$3,$4,$5,$6,$7,$8]}' file1 file2 > newfile

有人可以告诉我,如果我在附近有任何帮助,如果这是错的吗?我的文件是1M +行,现在正在运行,但我担心它可能需要一段时间才能看到它是否正常工作!提前致谢

2 个答案:

答案 0 :(得分:1)

你走在正确的道路上,看起来你需要小修正

更改

a[$$1,$2,$3,$4,$5,$6,$7,$8]
  ^
 Here

a[$1,$2,$3,$4,$5,$6,$7,$8]

因此,如果使用file1的8个字段构成的索引键存在于使用file1的前8个字段创建的数组a中,则它将从数组a中的file1打印第10个字段。

答案 1 :(得分:1)

切换输入文件的顺序并整理:

    **PlaceHolder Appears**
    <textarea placeholder="Am Default Message"></textarea>

    **PlaceHolder Doesn't Appear**

    <textarea placeholder="Am Default Message">  </textarea>
   <textarea placeholder="Am Default Message"> 
   </textarea>
   <textarea placeholder="Am Default Message">Something</textarea>