我需要一个使用blast信息读取文件的正则表达式。 该文件看起来像:
****ALIGNMENT****
Sequence: gi|516137619|ref|WP_017568199.1| hypothetical protein [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
我已经有了代码,但是在这个文件中有一些额外的数据。在此示例中具有相应名称的变量名称为:
hitsid = 516137619
protein = hypothetical protein
organism = Nocardiopsis synnemataformans
length = 136
evalue = 8.9548e-11
score = 153.0
bitscore = 63.5438
identities = 35
positives = 42
gaps = 6
query = MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match = MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
subject = MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
我正在寻找类似这样的东西,这是我已经拥有的正则表达式,但现在又添加了一些东西:
p = re.compile(r'^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)', re.MULTILINE)
文件看起来像:
****ALIGNMENT****
Sequence: gi|516137619|ref|WP_017568199.1| hypothetical protein [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
****ALIGNMENT****
Sequence: gi|962700925|ref|BC_420072443.1| Protein crossbronx-like [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
****ALIGNMENT****
Sequence: gi|516137619|ref|WP_017568199.1| hypothetical protein [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
答案 0 :(得分:1)
你不需要正则表达式:
parsed = []
raw_parts = open('tmp9.txt','r').read().split('****ALIGNMENT****')
for raw_part in raw_parts:
parsed_dict = {}
for line in raw_part.split('\n'):
try:
key,value = line.split(':')
parsed_dict[key] = value.strip()
except:
pass
parsed.append(parsed_dict)
print(parsed)