我有一个目前看起来像这样的文件,例如:
>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001264_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001264_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
>ENSOFAS004569_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341
关注>
和|
符号之间的文字部分,我想根据匹配的ENSOFAS
数字ID添加顺序编号。也就是说,我想采取这一点,并将其做到这一点:
>ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
>ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341
我可以在textwrangler(> ENSOFAS(\ d +)_ p(。+)\ r)中进行搜索,但我知道文本编辑器无法在{{1}之后添加数字方面做我需要的工作}。我认为搜索部分的macOS linux版本可能是_p
,但不知道如何在grep -E ">ENSOFAS[0-9]\{6\}_p\s|"
和_p
之前的空格之间进行编号。匹配ENSOFAS数字不会在文本文件中聚集在一起,但如果需要,我可以采用某种排序方式。
答案 0 :(得分:0)
如果您的设置中有awk
选项:
$ awk '{cnt[$1]++; $1=$1""cnt[$1]; print}' file
>ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
>ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341
说明:$1
将包含第一个字段(对于每一行),例如>ENSOFAS001369_p
。我们使用关联数组cnt
来计算来自$1
的每个唯一标记的出现次数,并修改字段$1
(先前输出)以包括处理的记录/行的当前计数。 / p>
awk
脚本可以缩短,但在这种形式下可能更易读和理解。
答案 1 :(得分:0)
短 awk 方法:
awk '{ $1=$1""++a[$1] }1' file
输出:
>ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
>ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341
使用 awk sub()
函数的替代方法:
awk '{ sub(/$/,++a[$1],$1) }1' file