我正在使用sed
编辑文件并遇到了一个问题,希望sed guru能够解决这个问题。
我有一个非结构化/部分结构化的文件,如下所示
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG02291
1 10177 rs367896724 A AC 100 PASS AA=1 GT 1|0
1 10235 rs540431307 T TA 100 PASS XX=5 GT 0|0
1 10352 rs555500075 T TA 100 PASS JJ=7 GT 0|1
我已使用以下命令在文件中插入了一行
sed 's/.*##source_.*/\#\#INFO=\<ID=P_ID\,Number=1\,Type=String\,Description=\"Person Identifier\"\>\n&/' infile > outfile
输出看起来像这样
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=P_ID,Number=1,Type=String,Description="Patient Identifier">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG02291
1 10177 rs367896724 A AC 100 PASS AA=1 GT 1|0
1 10235 rs540431307 T TA 100 PASS XX=5 GT 0|0
1 10352 rs555500075 T TA 100 PASS JJ=7 GT 0|1
我要做的下一件事是,将上述文件作为输入,并将;P_ID=12345
追加到第8列,即AA=1
,XX=5
,JJ=7
输出应该看起来像 -
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=P_ID,Number=1,Type=String,Description="Patient Identifier">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG02291
1 10177 rs367896724 A AC 100 PASS AA=1;P_ID=12345 GT 1|0
1 10235 rs540431307 T TA 100 PASS XX=5;P_ID=12345 GT 0|0
1 10352 rs555500075 T TA 100 PASS JJ=7;P_ID=12345 GT 0|1
到目前为止,我已设法选择第8列,但我不确定如何在附加信息后将更新的行放回文件。
这就是我选择第8列的方法 -
sed -re '{s/^(\S+\s+){7}(\S+).*$/\2/;p}'
任何人都可以帮我解决这个难题吗?
提前致谢!
PRASHANT
答案 0 :(得分:2)
sed -re '{s/^((\S+ +){7}\S+)/\1;P_ID=12345/}' /tmp/so5.txt
其中/tmp/so5.txt
是您的输入文件。
答案 1 :(得分:1)
sed用于单个行上的简单替换,即全部。对于其他任何你应该使用awk:
$ awk '
/^##source_/ { print "##INFO=<ID=P_ID,Number=1,Type=String,Description=\"Person Identifier\">" }
!/^#/ { $8 = $8 ";P_ID=12345" }
{ print }
' file
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=P_ID,Number=1,Type=String,Description="Person Identifier">
##source_20160901.1=vcf-subset(r940) -f -c HG02291 /net/isilonP/public/rw/ensembl/1000genomes/release-17/tmp/slicer/1.1-1000000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG02291
1 10177 rs367896724 A AC 100 PASS AA=1;P_ID=12345 GT 1|0
1 10235 rs540431307 T TA 100 PASS XX=5;P_ID=12345 GT 0|0
1 10352 rs555500075 T TA 100 PASS JJ=7;P_ID=12345 GT 0|1