我有两个文件。一个文件是包含多个列的制表符分隔文件。另一个文件是基因名称列表。我必须只提取文件1中列出的那些具有文件2中列出的基因的行。 我尝试了以下命令但它提取了所有行:
awk 'NR==FNR{a[$0]=1;next} {for(i in a){if($10~i){print;break}}}' File2 file1
File1中:
Input line ID Chrom Position Strand Ref. base(s) Alt. base(s) Sample ID HUGO symbol Sequence ontology Protein sequen
3 VAR113_NM-02_TUMOR_DNA chr1 11082255 + G T NM-02_TUMOR_DNA TARDBP MS K263N . PASS het 3 25
4 VAR114_NM-02_TUMOR_DNA chr1 15545868 + G T NM-02_TUMOR_DNA TMEM51 MS V131F . PASS het 3 13
6 VAR116_NM-02_TUMOR_DNA chr1 20676680 + C T NM-02_TUMOR_DNA VWA5B1 SY S970S . PASS het 4 34
7 rs149021429_NM-02_TUMOR_DNA chr1 21554495 + C A NM-02_TUMOR_DNA ECE1 SY S570S . PASS het 3
16 VAR126_NM-02_TUMOR_DNA chr1 39905109 + C T NM-02_TUMOR_DNA MACF1 SY V4069V . PASS het 4 17
21 VAR131_NM-02_TUMOR_DNA chr1 101387378 + G T NM-02_TUMOR_DNA SLC30A7 MS A275S . PASS het 4 45
24 VAR134_NM-02_TUMOR_DNA chr1 113256156 + C A NM-02_TUMOR_DNA PPM1J MS S135I . PASS het 3 9
25 rs201097299_NM-02_TUMOR_DNA chr1 145326106 + A T NM-02_TUMOR_DNA NBPF10 MS M1327L . PASS het 5
26 VAR136_NM-02_TUMOR_DNA chr1 149859281 + T C NM-02_TUMOR_DNA HIST2H2AB SY E62E . PASS het 11
27 VAR137_NM-02_TUMOR_DNA chr1 150529801 + C A NM-02_TUMOR_DNA ADAMTSL4 SY S679S . PASS het 3
28 rs376491237_NM-02_TUMOR_DNA chr1 150532649 + C A NM-02_TUMOR_DNA ADAMTSL4 SY R1068R . PASS het
34 VAR144_NM-02_TUMOR_DNA chr1 160389277 + T A NM-02_TUMOR_DNA VANGL2 SY L226L . PASS het 3 6
35 VAR145_NM-02_TUMOR_DNA chr1 161012389 + C A NM-02_TUMOR_DNA USF1 MS D44Y . PASS het 3 32
37 VAR147_NM-02_TUMOR_DNA chr1 200954042 + G T NM-02_TUMOR_DNA KIF21B MS R1250S . PASS het 3 21
41 rs191896925_NM-02_TUMOR_DNA chr1 207760805 + G T NM-02_TUMOR_DNA CR1 MS G1869W . PASS het 3
42 VAR152_NM-02_TUMOR_DNA chr1 208218427 + C A NM-02_TUMOR_DNA PLXNA2 SY G1208G . PASS het 3 13
43 VAR153_NM-02_TUMOR_DNA chr1 222715425 + A G NM-02_TUMOR_DNA HHIPL2 SY Y349Y . PASS het 10 41
44 VAR154_NM-02_TUMOR_DNA chr1 222715452 + T A NM-02_TUMOR_DNA HHIPL2 SY G340G . PASS het 5 46
45 VAR155_NM-02_TUMOR_DNA chr1 223568296 + G A NM-02_TUMOR_DNA C1orf65 SY G493G . PASS het 3 25
48 VAR158_NM-02_TUMOR_DNA chr2 8931258 + G A NM-02_TUMOR_DNA KIDINS220 MS P458L . PASS het 3 13
51 VAR161_NM-02_TUMOR_DNA chr2 37229656 + C A NM-02_TUMOR_DNA HEATR5B MS G1704C . PASS het 4 9
60 VAR170_NM-02_TUMOR_DNA chr2 84775506 + G T NM-02_TUMOR_DNA DNAH6 MS Q427H . PASS het 3 20
63 VAR173_NM-02_TUMOR_DNA chr2 86378563 + C A NM-02_TUMOR_DNA IMMT MS A420S . PASS het 6 29
64 VAR174_NM-02_TUMOR_DNA chr2 86716546 + G T NM-02_TUMOR_DNA KDM3A MS C1140F . PASS het 3 18
65 VAR175_NM-02_TUMOR_DNA chr2 96852612 + C A NM-02_TUMOR_DNA STARD7 SY L323L . PASS het 2 2
67 VAR177_NM-02_TUMOR_DNA chr2 121747740 + C A NM-02_TUMOR_DNA GLI2 MS P1417H . PASS het 2 2
71 rs199770435_NM-02_TUMOR_DNA chr2 130872871 + C T NM-02_TUMOR_DNA POTEF SY G184G . PASS het 8
72 rs199695856_NM-02_TUMOR_DNA chr2 132919171 + A G NM-02_TUMOR_DNA ANKRD30BL SY H36H . PASS het
73 rs111295191_NM-02_TUMOR_DNA chr2 132919192 + G A NM-02_TUMOR_DNA ANKRD30BL SY N29N . PASS het
76 VAR186_NM-02_TUMOR_DNA chr2 167084231 + T A NM-02_TUMOR_DNA SCN9A SY A1392A . PASS het 3 19
77 VAR187_NM-02_TUMOR_DNA chr2 168100115 + C G NM-02_TUMOR_DNA XIRP2 MS T738S . PASS het 9 49
78 VAR188_NM-02_TUMOR_DNA chr2 179343033 + G T NM-02_TUMOR_DNA FKBP7 MS A65D . PASS het 3 7
79 VAR189_NM-02_TUMOR_DNA chr2 179544108 + G C NM-02_TUMOR_DNA TTN MS P11234A . PASS het 3 17
82 VAR192_NM-02_TUMOR_DNA chr2 220074164 + G T NM-02_TUMOR_DNA ZFAND2B MS E92D . PASS het 2 2
83 VAR193_NM-02_TUMOR_DNA chr2 220420892 + C A NM-02_TUMOR_DNA OBSL1 MS G1487W . PASS het 3 9
84 rs191578275_NM-02_TUMOR_DNA chr2 233273263 + C A NM-02_TUMOR_DNA ALPPL2 MS P279Q . PASS het 3
86 VAR196_NM-02_TUMOR_DNA chr2 241815391 + G T NM-02_TUMOR_DNA AGXT SY L272L . PASS het 3 10
88 VAR198_NM-02_TUMOR_DNA chr3 9484995 + C T NM-02_TUMOR_DNA SETD5 SG R361* . PASS het 3 18
96 VAR206_NM-02_TUMOR_DNA chr3 49848502 + G T NM-02_TUMOR_DNA UBA7 MS P382H . PASS het 5 38
102 VAR212_NM-02_TUMOR_DNA chr3 58302669 + G T NM-02_TUMOR_DNA RPP14 MS L89F . PASS het 3 30
103 VAR213_NM-02_TUMOR_DNA chr3 63981750 + C A NM-02_TUMOR_DNA ATXN7 MS T751K . PASS het 3 13
104 rs146577101_NM-02_TUMOR_DNA chr3 97868656 + C T NM-02_TUMOR_DNA OR5H14 MS R143W . PASS het 4
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174
115 VAR225_NM-02_TUMOR_DNA chr3 183753779 + C A NM-02_TUMOR_DNA HTR3D MS P91T . PASS het 4 48
文件2:
FBN1
HELZ
RALGPS2
DYNC1I2
NFE2L2
POSTN
INO80
我想要那些包含这些基因的行。
答案 0 :(得分:1)
因此,如果我正确关注您,您只想使用$9
中的基因在file1
中搜索file2
,并将MYLK
添加到我获得的列表中:< / p>
也许:
awk 'NR==FNR{A[$1];next}$9 in A' file2 file1
**empty line** (since `MYLK` was found after the line break it is included
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174
从输出中删除换行符:
awk 'NR==FNR{A[$1];next}$9 in A' file2 file1 | awk '!/^$/'
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174