Question

然而我对AWK的不熟悉让我失望，我无法弄清楚如何在一行的末尾匹配一个变量？

这对于grep等来说是相当微不足道的，但我有兴趣在tsv的特定字段中的字符串末尾匹配整数，并且所有帖子都建议（我相信它是案件！）awk是要走的路。

如果我想只匹配一个明确的，这很容易：

这是我的示例文件：

PVClopT_11  PAU_02102   PAU_02064   1pqx    1pqx_A  37.4    13  0.00035 31.4    >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A   No DOI found.
PVCpnf_18   PAK_3526    PAK_03186   3fxq    3fxq_A  99.7    2.7e-21 7e-26   122.2   >3fxq_A LYSR type regulator of TSAMBCD; transcriptional regulator, LTTR, TSAR, WHTH, DNA- transcription, transcription regulation; 1.85A {Comamonas testosteroni} PDB: 3fxr_A* 3fxu_A* 3fzj_A 3n6t_A 3n6u_A*    10.1111/j.1365-2958.2010.07043.x
PVCunit1_19 PAU_02807   PAU_02793   3kx6    3kx6_A  19.7    45  0.0012  31.3    >3kx6_A Fructose-bisphosphate aldolase; ssgcid, NIH, niaid, SBRI, UW, emerald biostructures, glycolysis, lyase, STRU genomics; HET: CIT; 2.10A {Babesia bovis}  No DOI found.
PVClumt_17  PAU_02231   PAU_02190   3lfh    3lfh_A  39.7    12  0.0003  28.9    >3lfh_A Manxa, phosphotransferase system, mannose/fructose-speci component IIA; PTS; 1.80A {Thermoanaerobacter tengcongensis}   No DOI found.
PVCcif_11   plu2521 PLT_02558   3h2t    3h2t_A  96.6    2.6e-05 6.7e-10 79.0    >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_16   PAU_03338   PAU_03377   5jbr    5jbr_A  29.2    22  0.00058 23.9    >5jbr_A Uncharacterized protein BCAV_2135; structural genomics, PSI-biology, midwest center for structu genomics, MCSG, unknown function; 1.65A {Beutenbergia cavernae} No DOI found.
PVCunit1_17 PAK_2892    PAK_02622   1cii    1cii_A  63.2    2.7 6.9e-05 41.7    >1cii_A Colicin IA; bacteriocin, ION channel formation, transmembrane protein; 3.00A {Escherichia coli} SCOP: f.1.1.1 h.4.3.1   10.1038/385461a0
PVCunit1_11 PAK_2886    PAK_02616   3h2t    3h2t_A  96.6    1.9e-05 4.9e-10 79.9    >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_11   PAU_03343   PAU_03382   3h2t    3h2t_A  97.4    4.4e-07 1.2e-11 89.7    >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCunit1_5  afp5    PAU_02779   4tv4    4tv4_A  63.6    2.6 6.7e-05 30.5    >4tv4_A Uncharacterized protein; unknown function, ssgcid, virulence, structural genomics; 2.10A {Burkholderia pseudomallei}    No DOI found.

我可以通过在命令行上运行以下命令来拉出第一列末尾带有“_11”的所有行：

awk '{ if ($1 ~ /_11$/) { print } }' 02052017_HHresults_sorted.tsv

我想将它包含在一个循环中以覆盖1 - 5（例如）中的所有整数，但是我在将变量传递给文本匹配时遇到了麻烦。

我希望它应该类似于以下内容，但 $i$ 似乎可能不正确并且google-fu让我失望：

awk 'BEGIN{ for (i=1;i<=5;i++){ if ($1 ~ /_$i$/) { print } } }' 02052017_HHresults_sorted.tsv

我可能还有其他问题，我没有发现awk命令，正如我所说，我不是很精明。

编辑澄清

我希望分离所有匹配项，因此不能使用字符类。即我想在一个文件中以“_1”结尾的所有行，然后在另一个文件中以“_2”结尾的所有行，依此类推（因此循环）。

Answer 1

你不能把变量放在//里面。使用字符串连接，这是通过简单地将字符串彼此相邻放在awk中完成的。当您使用〜运算符时，您不需要使用正则表达式文字，它总是将第二个参数视为正则表达式。 awk＆＃39; {for（i = 1; i＆lt; = 5; i ++）{ if（$ 1~（＆＃34; _＆＃34; i＆＃34; $＆＃34;））{print;打破; } }＆＃39; 02052017_HHresults_sorted.tsv

Answer 2

听起来你正在考虑这一切都是错的，而你真正需要的只是（使用GNU awk for gensub（））：

awk '{ print > ("out" gensub(/.*_/,"",1,$1)) }' 02052017_HHresults_sorted.tsv

或与任何awk：

awk '{ n=$1; sub(/.*_/,"",n); print > ("out" n) }' 02052017_HHresults_sorted.tsv

Answer 3

无需循环，使用正则表达式字符类[..]：

awk 'match($1,/_([1-5])$/,a){ print >> a[1]".txt" }' 02052017_HHresults_sorted.tsv

使用awk

3 个答案: