Question

我有这种文本文件格式：

01  contig00041 1   878 +   YP_003990830.1  metalloendopeptidase, glycoprotease family  Geobacillus sp. Y4.1MC1 100.00  291 1   291 47  337 0.0 592 #line 1
01  contig00041 1241    3117    -   YP_002948419.1  ABC transporter Geobacillus sp. WCH70   84.94 #line 2
37.31   624 #line 3
260 1 #line 4
321 624 #line 5
532 23 #line 6
12  644 #line 7
270 0.0 #line 8
3e-37   1046 #line 9
154 #line 10

我必须检测包含8列（第2行）的行，并将后续7行（第3 - 9行）的第二列转置到8列行的末尾。最后，排除第10行。此模式沿着大文本文件重复，但不常见（30次，在2000行的文件中）。有可能使用awk吗？

编辑后的文本文件必须如下所示：

01  contig00041 1   878 +   YP_003990830.1  metalloendopeptidase, glycoprotease family  Geobacillus sp. Y4.1MC1 100.00  291 1   291 47  337 0.0 592 #line 1
01  contig00041 1241    3117    -   YP_002948419.1  ABC transporter Geobacillus sp. WCH70   84.94   624 1   624 23  644 0.0 1046 #line 2

非常感谢你。

Answer 1

awk 'NF == 12 { t = $0; for (i = 1; i <= 7; ++i) { r = getline; if (r < 1) break; t = t "\t" $2; } print t; next; } NF > 12' temp.txt

输出：

01  contig00041 1   878 +   YP_003990830.1  metalloendopeptidase, glycoprotease family  Geobacillus sp. Y4.1MC1 100.00  291 1   291 47  337 0.0 592
01  contig00041 1241    3117    -   YP_002948419.1  ABC transporter Geobacillus sp. WCH70   84.94       624     1       624     23      644     0.0 1046

它会自动打印超过12个字段的行。
如果它检测到包含12个字段的行，则连接其他7行的第二列并打印。
忽略任何其他行。

Answer 2

已修改，仅添加两行第二列。

我认为这可以满足您的需求：

awk 'NF >= 8 { a[++i] = $0 } NF == 2 { a[i] = a[i] " " $2 } END { for (j = 1; j <= i; ++j) print a[j] }' file

对于包含8列以上的行，请向数组a添加新元素。如果该行有2列，则将内容追加到当前数组元素。处理完整个文件后，浏览数组并打印所有行。

输出：

01  contig00041 1   878 +   YP_003990830.1  metalloendopeptidase, glycoprotease family  Geobacillus sp. Y4.1MC1 100.00  291 1   291 47  337 0.0 592
01  contig00041 1241    3117    -   YP_002948419.1  ABC transporter Geobacillus sp. WCH70   84.94 624 1 624 23 644 0.0 1046

检测到图案后将列转置为一条线

2 个答案: