我有几个文件包含以特定方式编写的行,例如:
>m.144 g.144 ORF g.144 m.144 type:internal len:123 (+) Pf1004_1/1_1.000_369:1-372(+)
我想使用带regexp的sed命令删除一些字符以便使用这种格式:
>Pf1004_1/1_1.000_369
但它不起作用:/。我使用了以下脚本:
#/bin/bash
for file in *.fasta # Set of fasta files in the script directory
do
sed -i "s/.+?\(\+\) />/g" $file
sed -i "s/:.+//g" $file
done
有什么问题?以下是我的一个文件的概述:
>m.187 g.187 ORF g.187 m.187 type:internal len:115 (+) Ph1000_1/1_1.000_345:1-348(+)
LIILLTSVSVVVLLVENHLSPSHSVLDLSSEPPTGNATYHCWEVAETVIVIKECSPCSVF
EQKTNPACKETGYSQKVLCMLKDGTESKLPRSCPKITWVEEKQFWLFEVLMALLG
>m.188 g.188 ORF g.188 m.188 type:internal len:100 (+) Ph1002_1/1_1.000_302:1-303(+)
KTDTPRRQRSMSPVANVSCSPSVSSPNLLMKLLDSSDESESDTPHPNRVKVLKPDDMGIK
DFFKNTAAKQGLEERVDVSIQDFDHIINEASDRLPCTKKI
>m.189 g.189 ORF g.189 m.189 type:internal len:125 (+) Ph1007_1/1_1.000_376:1-378(+)
QSATPLHRAAEANRKQAVAELLHAGCDVNRQNEVSITPIFYPAQRGDDVTTRLLIQNGAD
PNVTDAEDWIPLHFASQNGHVATVDALTSARSMVNAAGSHGETPLLIAAEQGHDKVVKHL
LANGA
>m.190 g.190 ORF g.190 m.190 type:internal len:129 (+) Ph1010_1/1_1.000_387:1-390(+)
HVADTGTSSSPQLSPTHAERRPLKVEFIGMKDMASGDTSGRDKRPGVENDLKRINRKATN
CARYQQPRMSLLGKPLNYRAHKRDVRYRRAQAKVYNFLERPKDWRAISYHLLVYVELRDS
TLTVFHPSM
>m.191 g.191 ORF g.191 m.191 type:internal len:185 (+) Ph1014_1/1_1.000_555:1-558(+)
CLADLVTASDNMENDLSDNSNLDQSGTMYAFAAKRKSYGQVKDADHVDSGGDNPERQERP
MSPMCLKIRKSDNGLSPEARRPVTSPSPISPAAPVSDHVDADRDVIERAKELQKAELDKV
VASSFPVPQSGFRSVHSVDISPLHRISVPWPHPVHQPIFPHPHPVALQMSLSNSFRAQNP
DACIR
>m.192 g.192 ORF g.192 m.192 type:internal len:183 (+) Ph1025_1/1_1.000_551:1-552(+)
TQKDWRELLWTYCCCCSKRHVHAEDVDKSAVTSLSEVKAEKQLKSPAKIKTIRNHADVKS
ALSTSCLRRKKNFEEQTICKNELNVKHSDDDNRDMDKQDTKTAITLTPKCFVHFPKSVNH
LQLDQTPLYWGAVSKEAASLCSLPVRNGCTVAAVKDVQDPHLLEIGQVYQNDEEWTPKEL
TAD
>m.19 g.19 ORF g.19 m.19 type:internal len:348 (+) Ph103_1/1_1.000_1044:1-1047(+)
GGHLPSFNDRPGNTMAGSKDDKTNLSPVKLELISPCGPVLSNHVGCIVNNVLYIHGGINK
YLSKEPLNAFYKLNLNAPSPIWQEILDRNSPHLSHHACVVLDNRYLVLIGGWNGKQRTAD
MWAYDVQEAVWISLRTSGFPEGAGLSSHAALPLADGSILVIGREGSARIQRRYGNSWLIR
GSVMRGHFVYNEHQMSLASRSGHTMHVIGSDLTIIGGRSDRQVEQHGGYRTAMTSSAVAF
FSGLNQFVKRTPPMAKPPCGRKQHVSASGSGLILIHGGETFDGKSRHPVGDFYIISLRPT
VTWYHLGTSGVGRAGHVCCTAADKIIIHGGMGPRNAIYGDTYEISLSK
>m.193 g.193 ORF g.193 m.193 type:internal len:130 (+) Ph1046_1/1_1.000_390:1-393(+)
LFRLASESYHSSKMVQRLTLRRRLSYNTSSNRRRIVKTPGGRLVYHYTKKPGAIPICKSG
GCRTKLHGIRPSRPMQRRRMSKRLKTVNRTYGGVQCHTCVREKIIRAFLIEEQKIVVKVL
KAQAAQAKKA
答案 0 :(得分:0)
将2 sed 表达式替换为以下表达式:
sed -E 's/^>.+\(\+\) ([^:]+):.+$/>\1/' $file
答案 1 :(得分:0)
为什么不这样做:
sed -e 's/^.*[ ]([+])[ ]/>/g' -e 's/[:].*$//' $file
第一个表达:
's/^.*[ ]([+])[ ]/>/g'
将从开头删除到第一个space
,后跟(+)space
。
第二个表达:
's/[:].*$//'
只需将:
到最后的所有内容剪掉。
示例强>
$ echo ">m.144 g.144 ORF g.144 m.144 type:internal len:123 (+) Pf1004_1/1_1.000_369:1-372(+)" | \
sed -e 's/^.*[ ]([+])[ ]/>/g' -e 's/[:].*$//'
>Pf1004_1/1_1.000_369
答案 2 :(得分:0)
我认为问题可能是sed的正则表达式不是你所期望的。请参阅此处获取解释,尤其是“+”表示的内容:https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
答案 3 :(得分:0)
根据数据结构的完整程度,这个简单的awk
脚本就足够了:
awk -F '[ :]' '/^>/ { print ">" $12; next } 1' infile
输出:
>Ph1000_1/1_1.000_345
LIILLTSVSVVVLLVENHLSPSHSVLDLSSEPPTGNATYHCWEVAETVIVIKECSPCSVF
EQKTNPACKETGYSQKVLCMLKDGTESKLPRSCPKITWVEEKQFWLFEVLMALLG
>Ph1002_1/1_1.000_302
KTDTPRRQRSMSPVANVSCSPSVSSPNLLMKLLDSSDESESDTPHPNRVKVLKPDDMGIK
DFFKNTAAKQGLEERVDVSIQDFDHIINEASDRLPCTKKI
>Ph1007_1/1_1.000_376
QSATPLHRAAEANRKQAVAELLHAGCDVNRQNEVSITPIFYPAQRGDDVTTRLLIQNGAD
PNVTDAEDWIPLHFASQNGHVATVDALTSARSMVNAAGSHGETPLLIAAEQGHDKVVKHL
LANGA
>Ph1010_1/1_1.000_387
HVADTGTSSSPQLSPTHAERRPLKVEFIGMKDMASGDTSGRDKRPGVENDLKRINRKATN
CARYQQPRMSLLGKPLNYRAHKRDVRYRRAQAKVYNFLERPKDWRAISYHLLVYVELRDS
TLTVFHPSM
>Ph1014_1/1_1.000_555
CLADLVTASDNMENDLSDNSNLDQSGTMYAFAAKRKSYGQVKDADHVDSGGDNPERQERP
MSPMCLKIRKSDNGLSPEARRPVTSPSPISPAAPVSDHVDADRDVIERAKELQKAELDKV
VASSFPVPQSGFRSVHSVDISPLHRISVPWPHPVHQPIFPHPHPVALQMSLSNSFRAQNP
DACIR
>Ph1025_1/1_1.000_551
TQKDWRELLWTYCCCCSKRHVHAEDVDKSAVTSLSEVKAEKQLKSPAKIKTIRNHADVKS
ALSTSCLRRKKNFEEQTICKNELNVKHSDDDNRDMDKQDTKTAITLTPKCFVHFPKSVNH
LQLDQTPLYWGAVSKEAASLCSLPVRNGCTVAAVKDVQDPHLLEIGQVYQNDEEWTPKEL
TAD
>Ph103_1/1_1.000_1044
GGHLPSFNDRPGNTMAGSKDDKTNLSPVKLELISPCGPVLSNHVGCIVNNVLYIHGGINK
YLSKEPLNAFYKLNLNAPSPIWQEILDRNSPHLSHHACVVLDNRYLVLIGGWNGKQRTAD
MWAYDVQEAVWISLRTSGFPEGAGLSSHAALPLADGSILVIGREGSARIQRRYGNSWLIR
GSVMRGHFVYNEHQMSLASRSGHTMHVIGSDLTIIGGRSDRQVEQHGGYRTAMTSSAVAF
FSGLNQFVKRTPPMAKPPCGRKQHVSASGSGLILIHGGETFDGKSRHPVGDFYIISLRPT
VTWYHLGTSGVGRAGHVCCTAADKIIIHGGMGPRNAIYGDTYEISLSK
>Ph1046_1/1_1.000_390
LFRLASESYHSSKMVQRLTLRRRLSYNTSSNRRRIVKTPGGRLVYHYTKKPGAIPICKSG
GCRTKLHGIRPSRPMQRRRMSKRLKTVNRTYGGVQCHTCVREKIIRAFLIEEQKIVVKVL
KAQAAQAKKA