使用sed编辑非结构化文件

时间:2016-09-03 17:53:23

标签: bash sed

我正在使用sed编辑文件并遇到了一个问题,希望sed guru能够解决这个问题。

我有一个非结构化/部分结构化的文件,如下所示

##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG02291
1   10177   rs367896724 A   AC  100 PASS    AA=1    GT  1|0
1   10235   rs540431307 T   TA  100 PASS    XX=5    GT  0|0
1   10352   rs555500075 T   TA  100 PASS    JJ=7    GT  0|1

我已使用以下命令在文件中插入了一行

sed 's/.*##source_.*/\#\#INFO=\<ID=P_ID\,Number=1\,Type=String\,Description=\"Person Identifier\"\>\n&/' infile > outfile

输出看起来像这样

##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=P_ID,Number=1,Type=String,Description="Patient Identifier">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG02291
1   10177   rs367896724 A   AC  100 PASS    AA=1    GT  1|0
1   10235   rs540431307 T   TA  100 PASS    XX=5    GT  0|0
1   10352   rs555500075 T   TA  100 PASS    JJ=7    GT  0|1

我要做的下一件事是,将上述文件作为输入,并将;P_ID=12345追加到第8列,即AA=1XX=5JJ=7

输出应该看起来像 -

##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=P_ID,Number=1,Type=String,Description="Patient Identifier">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG02291
1   10177   rs367896724 A   AC  100 PASS    AA=1;P_ID=12345 GT  1|0
1   10235   rs540431307 T   TA  100 PASS    XX=5;P_ID=12345 GT  0|0
1   10352   rs555500075 T   TA  100 PASS    JJ=7;P_ID=12345 GT  0|1

到目前为止,我已设法选择第8列,但我不确定如何在附加信息后将更新的行放回文件。

这就是我选择第8列的方法 -

sed -re '{s/^(\S+\s+){7}(\S+).*$/\2/;p}'

任何人都可以帮我解决这个难题吗?

提前致谢!

PRASHANT

2 个答案:

答案 0 :(得分:2)

sed -re '{s/^((\S+ +){7}\S+)/\1;P_ID=12345/}' /tmp/so5.txt

其中/tmp/so5.txt是您的输入文件。

答案 1 :(得分:1)

sed用于单个行上的简单替换,即全部。对于其他任何你应该使用awk:

$ awk '
/^##source_/ { print "##INFO=<ID=P_ID,Number=1,Type=String,Description=\"Person Identifier\">" }
!/^#/ { $8 = $8 ";P_ID=12345" }
{ print }
' file
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=P_ID,Number=1,Type=String,Description="Person Identifier">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG02291
1 10177 rs367896724 A AC 100 PASS AA=1;P_ID=12345 GT 1|0
1 10235 rs540431307 T TA 100 PASS XX=5;P_ID=12345 GT 0|0
1 10352 rs555500075 T TA 100 PASS JJ=7;P_ID=12345 GT 0|1