如何根据模式对tsv文件进行子集化?

时间:2017-05-05 09:29:10

标签: awk sed grep

我有两个文件。一个文件是包含多个列的制表符分隔文件。另一个文件是基因名称列表。我必须只提取文件1中列出的那些具有文件2中列出的基因的行。 我尝试了以下命令但它提取了所有行:

 awk 'NR==FNR{a[$0]=1;next} {for(i in a){if($10~i){print;break}}}' File2 file1

File1中:

Input line ID Chrom Position Strand Ref. base(s) Alt. base(s) Sample ID HUGO symbol Sequence ontology Protein sequen 3 VAR113_NM-02_TUMOR_DNA chr1 11082255 + G T NM-02_TUMOR_DNA TARDBP MS K263N . PASS het 3 25 4 VAR114_NM-02_TUMOR_DNA chr1 15545868 + G T NM-02_TUMOR_DNA TMEM51 MS V131F . PASS het 3 13 6 VAR116_NM-02_TUMOR_DNA chr1 20676680 + C T NM-02_TUMOR_DNA VWA5B1 SY S970S . PASS het 4 34 7 rs149021429_NM-02_TUMOR_DNA chr1 21554495 + C A NM-02_TUMOR_DNA ECE1 SY S570S . PASS het 3 16 VAR126_NM-02_TUMOR_DNA chr1 39905109 + C T NM-02_TUMOR_DNA MACF1 SY V4069V . PASS het 4 17 21 VAR131_NM-02_TUMOR_DNA chr1 101387378 + G T NM-02_TUMOR_DNA SLC30A7 MS A275S . PASS het 4 45 24 VAR134_NM-02_TUMOR_DNA chr1 113256156 + C A NM-02_TUMOR_DNA PPM1J MS S135I . PASS het 3 9 25 rs201097299_NM-02_TUMOR_DNA chr1 145326106 + A T NM-02_TUMOR_DNA NBPF10 MS M1327L . PASS het 5 26 VAR136_NM-02_TUMOR_DNA chr1 149859281 + T C NM-02_TUMOR_DNA HIST2H2AB SY E62E . PASS het 11 27 VAR137_NM-02_TUMOR_DNA chr1 150529801 + C A NM-02_TUMOR_DNA ADAMTSL4 SY S679S . PASS het 3 28 rs376491237_NM-02_TUMOR_DNA chr1 150532649 + C A NM-02_TUMOR_DNA ADAMTSL4 SY R1068R . PASS het 34 VAR144_NM-02_TUMOR_DNA chr1 160389277 + T A NM-02_TUMOR_DNA VANGL2 SY L226L . PASS het 3 6 35 VAR145_NM-02_TUMOR_DNA chr1 161012389 + C A NM-02_TUMOR_DNA USF1 MS D44Y . PASS het 3 32 37 VAR147_NM-02_TUMOR_DNA chr1 200954042 + G T NM-02_TUMOR_DNA KIF21B MS R1250S . PASS het 3 21 41 rs191896925_NM-02_TUMOR_DNA chr1 207760805 + G T NM-02_TUMOR_DNA CR1 MS G1869W . PASS het 3 42 VAR152_NM-02_TUMOR_DNA chr1 208218427 + C A NM-02_TUMOR_DNA PLXNA2 SY G1208G . PASS het 3 13 43 VAR153_NM-02_TUMOR_DNA chr1 222715425 + A G NM-02_TUMOR_DNA HHIPL2 SY Y349Y . PASS het 10 41 44 VAR154_NM-02_TUMOR_DNA chr1 222715452 + T A NM-02_TUMOR_DNA HHIPL2 SY G340G . PASS het 5 46 45 VAR155_NM-02_TUMOR_DNA chr1 223568296 + G A NM-02_TUMOR_DNA C1orf65 SY G493G . PASS het 3 25 48 VAR158_NM-02_TUMOR_DNA chr2 8931258 + G A NM-02_TUMOR_DNA KIDINS220 MS P458L . PASS het 3 13 51 VAR161_NM-02_TUMOR_DNA chr2 37229656 + C A NM-02_TUMOR_DNA HEATR5B MS G1704C . PASS het 4 9 60 VAR170_NM-02_TUMOR_DNA chr2 84775506 + G T NM-02_TUMOR_DNA DNAH6 MS Q427H . PASS het 3 20 63 VAR173_NM-02_TUMOR_DNA chr2 86378563 + C A NM-02_TUMOR_DNA IMMT MS A420S . PASS het 6 29 64 VAR174_NM-02_TUMOR_DNA chr2 86716546 + G T NM-02_TUMOR_DNA KDM3A MS C1140F . PASS het 3 18 65 VAR175_NM-02_TUMOR_DNA chr2 96852612 + C A NM-02_TUMOR_DNA STARD7 SY L323L . PASS het 2 2 67 VAR177_NM-02_TUMOR_DNA chr2 121747740 + C A NM-02_TUMOR_DNA GLI2 MS P1417H . PASS het 2 2 71 rs199770435_NM-02_TUMOR_DNA chr2 130872871 + C T NM-02_TUMOR_DNA POTEF SY G184G . PASS het 8 72 rs199695856_NM-02_TUMOR_DNA chr2 132919171 + A G NM-02_TUMOR_DNA ANKRD30BL SY H36H . PASS het 73 rs111295191_NM-02_TUMOR_DNA chr2 132919192 + G A NM-02_TUMOR_DNA ANKRD30BL SY N29N . PASS het 76 VAR186_NM-02_TUMOR_DNA chr2 167084231 + T A NM-02_TUMOR_DNA SCN9A SY A1392A . PASS het 3 19 77 VAR187_NM-02_TUMOR_DNA chr2 168100115 + C G NM-02_TUMOR_DNA XIRP2 MS T738S . PASS het 9 49 78 VAR188_NM-02_TUMOR_DNA chr2 179343033 + G T NM-02_TUMOR_DNA FKBP7 MS A65D . PASS het 3 7 79 VAR189_NM-02_TUMOR_DNA chr2 179544108 + G C NM-02_TUMOR_DNA TTN MS P11234A . PASS het 3 17 82 VAR192_NM-02_TUMOR_DNA chr2 220074164 + G T NM-02_TUMOR_DNA ZFAND2B MS E92D . PASS het 2 2 83 VAR193_NM-02_TUMOR_DNA chr2 220420892 + C A NM-02_TUMOR_DNA OBSL1 MS G1487W . PASS het 3 9 84 rs191578275_NM-02_TUMOR_DNA chr2 233273263 + C A NM-02_TUMOR_DNA ALPPL2 MS P279Q . PASS het 3 86 VAR196_NM-02_TUMOR_DNA chr2 241815391 + G T NM-02_TUMOR_DNA AGXT SY L272L . PASS het 3 10 88 VAR198_NM-02_TUMOR_DNA chr3 9484995 + C T NM-02_TUMOR_DNA SETD5 SG R361* . PASS het 3 18
96 VAR206_NM-02_TUMOR_DNA chr3 49848502 + G T NM-02_TUMOR_DNA UBA7 MS P382H . PASS het 5 38 102 VAR212_NM-02_TUMOR_DNA chr3 58302669 + G T NM-02_TUMOR_DNA RPP14 MS L89F . PASS het 3 30 103 VAR213_NM-02_TUMOR_DNA chr3 63981750 + C A NM-02_TUMOR_DNA ATXN7 MS T751K . PASS het 3 13 104 rs146577101_NM-02_TUMOR_DNA chr3 97868656 + C T NM-02_TUMOR_DNA OR5H14 MS R143W . PASS het 4 107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18 108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174 115 VAR225_NM-02_TUMOR_DNA chr3 183753779 + C A NM-02_TUMOR_DNA HTR3D MS P91T . PASS het 4 48

文件2:

FBN1

HELZ

RALGPS2

DYNC1I2

NFE2L2

POSTN

INO80

我想要那些包含这些基因的行。

1 个答案:

答案 0 :(得分:1)

因此,如果我正确关注您,您只想使用$9中的基因在file1中搜索file2,并将MYLK添加到我获得的列表中:< / p>

也许:

awk 'NR==FNR{A[$1];next}$9 in A' file2 file1

**empty line** (since `MYLK` was found after the line break it is included
107     rs58176285_NM-02_TUMOR_DNA      chr3    123419183       +       G       A       NM-02_TUMOR_DNA MYLK    SY      A1044A  .       PASS    het     18
108     VAR218_NM-02_TUMOR_DNA  chr3    123419189       +       C       T       NM-02_TUMOR_DNA MYLK    SY      K1042K  .       PASS    het     23      174

从输出中删除换行符:

awk 'NR==FNR{A[$1];next}$9 in A' file2 file1 | awk '!/^$/' 

107     rs58176285_NM-02_TUMOR_DNA      chr3    123419183       +       G       A       NM-02_TUMOR_DNA MYLK    SY      A1044A  .       PASS    het     18
108     VAR218_NM-02_TUMOR_DNA  chr3    123419189       +       C       T       NM-02_TUMOR_DNA MYLK    SY      K1042K  .       PASS    het     23      174