我第一次决定玩Python中的一些类,而我认为这是一个实际上可以从它们的使用中受益的任务的例子。
我正在尝试从生物信息学工具中解析输出文本文件,但我不确定最佳方法是什么。
该文件包含几个不同的“数据部分”,我想收集所有这些部分。我最感兴趣的部分是由>>
字符串分隔的(即文件包含>>
结果1 >>
结果2 >>
等...)
对于开始分离我想要的块而不打开并一遍又一遍地读取文件的最佳方法,我有点亏本(特别是因为文件可能很大)。
ClusterBlast scores for /home/wms_joe/PVCs/other_genomes/multigene/operon_genbank/PVCcif_ATCC43949.gbk
Table of genes, locations, strands and annotations of query cluster:
PAU_01961 8 457 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAU_01962 521 1618 + major_tail_sheath_protein no_locus_tag
PAU_01963 1799 3280 + tail_sheath_protein no_locus_tag
PAU_01964 3334 4533 + tail_sheath_protein no_locus_tag
PAU_01965 4547 5005 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAU_01966 5002 5181 + hypothetical_protein no_locus_tag
PAU_01967 5168 5851 + hypothetical_protein no_locus_tag
PAU_01968 5848 7449 + Rhs_element_Vgr_protein no_locus_tag
PAU_01969 7462 7905 + baseplate_wedge_subunit no_locus_tag
PAU_01970 7902 8318 + hypothetical_protein no_locus_tag
PAU_01971 8527 11253 + hypothetical_protein no_locus_tag
PAU_01972 11246 14197 + hypothetical_protein no_locus_tag
PAU_01973 14333 15184 + hypothetical_protein no_locus_tag
PAU_01974 15247 17145 + hypothetical_protein no_locus_tag
PAU_01975 17155 19227 + ATP-dependent_zinc_metalloprotease_FtsH no_locus_tag
PAU_01976 19252 20166 + hypothetical_protein no_locus_tag
PAU_01977 20327 21223 + hypothetical_protein no_locus_tag
PAU_01978 21308 22291 + hypothetical_protein no_locus_tag
PAU_01979 22788 23684 + hypothetical_protein no_locus_tag
PAU_01980 23656 24114 + hypothetical_protein no_locus_tag
Significant hits:
1. PAU_1 Photorhabdus asymbiotica strain ATCC43949.
2. PAB_1 Photorhabdus asymbiotica strain Beaudesert.
3. PAN_5 Photorhabdus asymbiotica strain Nepal.
4. PAT_0 Photorhabdus asymbiotica strain Thai.
Details:
>>
1. PAU_1
Source: Photorhabdus asymbiotica strain ATCC43949.
Number of proteins with BLAST hits to this cluster: 31
MultiGeneBlast score: 31.7
Cumulative Blast bit score: 64022
Table of genes, locations, strands and annotations of subject cluster:
PAU_01961 2233799 2234248 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAU_01962 2234312 2235409 + major_tail_sheath_protein no_locus_tag
PAU_01963 2235590 2237071 + tail_sheath_protein no_locus_tag
PAU_01964 2237125 2238324 + tail_sheath_protein no_locus_tag
PAU_01965 2238338 2238796 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAU_01966 2238793 2238972 + hypothetical_protein no_locus_tag
PAU_01967 2238959 2239642 + hypothetical_protein no_locus_tag
PAU_01968 2239639 2241240 + Rhs_element_Vgr_protein no_locus_tag
PAU_01969 2241253 2241696 + baseplate_wedge_subunit no_locus_tag
PAU_01970 2241693 2242109 + hypothetical_protein no_locus_tag
PAU_01971 2242318 2245044 + hypothetical_protein no_locus_tag
PAU_01972 2245037 2247988 + hypothetical_protein no_locus_tag
PAU_01973 2248124 2248975 + hypothetical_protein no_locus_tag
PAU_01974 2249038 2250936 + hypothetical_protein no_locus_tag
PAU_01976 2253043 2253957 + hypothetical_protein no_locus_tag
PAU_01977 2254118 2255014 + hypothetical_protein no_locus_tag
PAU_01978 2255099 2256082 + hypothetical_protein no_locus_tag
PAU_01979 2256579 2257475 + hypothetical_protein no_locus_tag
PAU_01980 2257447 2257905 + hypothetical_protein no_locus_tag
Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
PAU_01961 PAU_01961 100 309 100.0 2e-108
PAU_01962 PAU_01962 100 749 100.0 0.0
PAU_01963 PAU_01963 100 1015 100.0 0.0
PAU_01964 PAU_01964 100 821 100.0 0.0
PAU_01965 PAU_01965 100 312 100.0 2e-109
PAU_01966 PAU_01966 100 117 100.0 1e-35
PAU_01967 PAU_01967 100 471 100.0 1e-169
PAU_01968 PAU_01968 100 1095 100.0 0.0
PAU_01969 PAU_01969 100 298 100.0 5e-104
PAU_01970 PAU_01970 100 277 100.0 3e-96
PAU_01971 PAU_01971 100 1866 100.0 0.0
PAU_01972 PAU_01972 100 2034 100.0 0.0
PAU_01973 PAU_01973 100 583 100.0 0.0
PAU_01974 PAU_01974 100 1273 100.0 0.0
PAU_01976 PAU_01976 100 614 100.0 0.0
PAU_01977 PAU_01977 100 604 100.0 0.0
PAU_01978 PAU_01978 100 676 100.0 0.0
PAU_01979 PAU_01979 100 608 100.0 0.0
PAU_01980 PAU_01980 100 300 100.0 1e-104
>>
2. PAB_1
Source: Photorhabdus asymbiotica strain Beaudesert.
Number of proteins with BLAST hits to this cluster: 31
MultiGeneBlast score: 31.7
Cumulative Blast bit score: 62512
Table of genes, locations, strands and annotations of subject cluster:
cmlA 528550 528990 + Chloramphenicol_acetyltransferase_2 no_locus_tag
PAB_00496 530556 531005 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAB_00497 531069 532169 + major_tail_sheath_protein no_locus_tag
PAB_00498 532217 533698 + tail_sheath_protein no_locus_tag
PAB_00499 533752 534951 + tail_sheath_protein no_locus_tag
PAB_00500 534965 535423 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAB_00501 535420 535599 + hypothetical_protein no_locus_tag
PAB_00502 535586 536269 + hypothetical_protein no_locus_tag
PAB_00503 536266 537867 + Rhs_element_Vgr_protein no_locus_tag
PAB_00504 537880 538323 + baseplate_wedge_subunit no_locus_tag
PAB_00505 538320 538733 + hypothetical_protein no_locus_tag
PAB_00506 538802 541537 + hypothetical_protein no_locus_tag
PAB_00507 541530 544499 + hypothetical_protein no_locus_tag
PAB_00508 544638 545486 + hypothetical_protein no_locus_tag
PAB_00509 545549 547444 + hypothetical_protein no_locus_tag
ftsH_1 547454 549538 + ATP-dependent_zinc_metalloprotease_FtsH no_locus_tag
PAB_00511 549563 550462 + hypothetical_protein no_locus_tag
PAB_00512 550627 551526 + hypothetical_protein no_locus_tag
PAB_00516 554417 555034 + hypothetical_protein no_locus_tag
PAB_00517 555006 555464 + hypothetical_protein no_locus_tag
dsdX_1 555806 557191 - DsdX_permease no_locus_tag
srlR_1 558367 559143 - Glucitol_operon_repressor no_locus_tag
ygbM 559155 559934 - Putative_hydroxypyruvate_isomerase_YgbM no_locus_tag
fucA_1 560079 560714 - L-fuculose_phosphate_aldolase no_locus_tag
Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
PAU_01961 PAB_00496 99 308 100.0 5e-108
PAU_01962 PAB_00497 91 692 100.273972603 0.0
PAU_01963 PAB_00498 84 803 101.419878296 0.0
PAU_01964 PAB_00499 92 770 100.0 0.0
PAU_01965 PAB_00500 96 302 99.3421052632 2e-105
PAU_01966 PAB_00501 98 115 100.0 6e-35
PAU_01967 PAB_00502 95 453 100.0 2e-162
PAU_01968 PAB_00503 94 1039 100.0 0.0
PAU_01969 PAB_00504 93 280 100.0 1e-96
PAU_01970 PAB_00505 89 248 100.0 2e-84
PAU_01971 PAB_00506 84 1572 100.330396476 0.0
PAU_01972 PAB_00507 84 1706 100.915564598 0.0
PAU_01973 PAB_00508 72 410 100.706713781 9e-144
PAU_01974 PAB_00509 79 992 100.316455696 0.0
PAU_01975 ftsH_1 77 1095 100.579710145 0.0
PAU_01976 PAB_00511 83 507 98.6842105263 0.0
PAU_01977 PAB_00512 87 533 100.0 0.0
PAU_01979 PAB_00516 95 396 68.7919463087 3e-139
PAU_01980 PAB_00517 96 291 100.0 4e-101
>>
3. PAN_5
Source: Photorhabdus asymbiotica strain Nepal.
Number of proteins with BLAST hits to this cluster: 29
MultiGeneBlast score: 29.2
Cumulative Blast bit score: 61588
Table of genes, locations, strands and annotations of subject cluster:
tdiR 3118573 3119202 + Transcriptional_regulatory_protein_TdiR no_locus_tag
xerD_3 3123014 3123238 - Tyrosine_recombinase_XerD no_locus_tag
PAN_02769 3124026 3124475 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAN_02770 3124539 3125639 + major_tail_sheath_protein no_locus_tag
PAN_02771 3125687 3127156 + tail_sheath_protein no_locus_tag
PAN_02772 3127210 3128409 + tail_sheath_protein no_locus_tag
PAN_02773 3128423 3128881 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAN_02774 3128878 3129057 + hypothetical_protein no_locus_tag
PAN_02775 3129044 3129727 + hypothetical_protein no_locus_tag
PAN_02776 3129724 3131325 + Rhs_element_Vgr_protein no_locus_tag
PAN_02777 3131338 3131781 + baseplate_wedge_subunit no_locus_tag
PAN_02778 3131778 3132191 + hypothetical_protein no_locus_tag
PAN_02779 3132260 3134998 + hypothetical_protein no_locus_tag
PAN_02780 3134991 3137954 + hypothetical_protein no_locus_tag
PAN_02781 3138093 3138938 + hypothetical_protein no_locus_tag
PAN_02782 3139001 3140908 + hypothetical_protein no_locus_tag
PAN_02784 3143027 3143926 + hypothetical_protein no_locus_tag
PAN_02785 3144090 3144989 + hypothetical_protein no_locus_tag
PAN_02789 3146760 3147656 + hypothetical_protein no_locus_tag
PAN_02790 3147628 3148086 + hypothetical_protein no_locus_tag
hcpA_11 3148136 3148615 - Secreted_protein_hcp no_locus_tag
ygbN 3148802 3150187 - Inner_membrane_permease_YgbN no_locus_tag
Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
PAU_01961 PAN_02769 99 308 100.0 5e-108
PAU_01962 PAN_02770 91 691 100.273972603 0.0
PAU_01963 PAN_02771 85 810 100.60851927 0.0
PAU_01964 PAN_02772 92 769 100.0 0.0
PAU_01965 PAN_02773 97 305 99.3421052632 1e-106
PAU_01966 PAN_02774 98 115 100.0 6e-35
PAU_01967 PAN_02775 95 452 100.0 3e-162
PAU_01968 PAN_02776 94 1040 100.0 0.0
PAU_01969 PAN_02777 93 280 100.0 1e-96
PAU_01970 PAN_02778 90 251 100.0 9e-86
PAU_01971 PAN_02779 84 1571 100.660792952 0.0
PAU_01972 PAN_02780 84 1701 100.915564598 0.0
PAU_01973 PAN_02781 70 407 100.706713781 1e-142
PAU_01974 PAN_02782 79 981 100.949367089 0.0
PAU_01976 PAN_02784 82 503 98.6842105263 1e-179
PAU_01977 PAN_02785 87 536 100.0 0.0
PAU_01979 PAN_02789 94 572 100.0 0.0
PAU_01980 PAN_02790 97 296 100.0 5e-103
>>
4. PAT_0
Source: Photorhabdus asymbiotica strain Thai.
Number of proteins with BLAST hits to this cluster: 29
MultiGeneBlast score: 29.2
Cumulative Blast bit score: 61577
Table of genes, locations, strands and annotations of subject cluster:
PAT_00132 127877 128326 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAT_00133 128390 129490 + major_tail_sheath_protein no_locus_tag
PAT_00134 129538 131007 + tail_sheath_protein no_locus_tag
PAT_00135 131061 132260 + tail_sheath_protein no_locus_tag
PAT_00136 132274 132732 + T4-like_virus_tail_tube_protein_gp19 no_locus_tag
PAT_00137 132729 132908 + hypothetical_protein no_locus_tag
PAT_00138 132895 133578 + hypothetical_protein no_locus_tag
PAT_00139 133575 135176 + Rhs_element_Vgr_protein no_locus_tag
PAT_00140 135189 135632 + baseplate_wedge_subunit no_locus_tag
PAT_00141 135629 136042 + hypothetical_protein no_locus_tag
PAT_00142 136111 138885 + hypothetical_protein no_locus_tag
PAT_00143 138878 141841 + hypothetical_protein no_locus_tag
PAT_00144 141980 142825 + hypothetical_protein no_locus_tag
PAT_00145 142888 144795 + hypothetical_protein no_locus_tag
PAT_00147 146914 147813 + hypothetical_protein no_locus_tag
PAT_00148 147977 148876 + hypothetical_protein no_locus_tag
PAT_00152 150647 151543 + hypothetical_protein no_locus_tag
PAT_00153 151515 151973 + hypothetical_protein no_locus_tag
hcpA_1 152023 152502 - Secreted_protein_hcp no_locus_tag
Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
PAU_01961 PAT_00132 99 308 100.0 5e-108
PAU_01962 PAT_00133 91 693 100.273972603 0.0
PAU_01963 PAT_00134 85 810 100.60851927 0.0
PAU_01964 PAT_00135 92 769 100.0 0.0
PAU_01965 PAT_00136 97 305 99.3421052632 1e-106
PAU_01966 PAT_00137 98 115 100.0 6e-35
PAU_01967 PAT_00138 95 452 100.0 3e-162
PAU_01968 PAT_00139 94 1039 100.0 0.0
PAU_01969 PAT_00140 93 280 100.0 1e-96
PAU_01970 PAT_00141 89 249 100.0 5e-85
PAU_01971 PAT_00142 83 1568 101.982378855 0.0
PAU_01972 PAT_00143 84 1701 100.915564598 0.0
PAU_01973 PAT_00144 70 407 100.706713781 1e-142
PAU_01974 PAT_00145 78 974 100.949367089 0.0
PAU_01976 PAT_00147 82 503 98.6842105263 1e-179
PAU_01977 PAT_00148 87 536 100.0 0.0
PAU_01979 PAT_00152 94 572 100.0 0.0
PAU_01980 PAT_00153 97 296 100.0 5e-103
该文件以与我提到的>>
之间的最后几个结果相同的方式继续运行。
文件的前两个部分(ClusterBlast scores...
和Significant hits
只出现一次,但对于>>
表示的每个结果,我都想将每个部分收集为一个单独的类具有各种属性和函数的实例,在我整理出如何分离文件段之后,我将要相应地编写。
我想要收集的每个结果类似于:
class MGB_hit(object):
"""
Store each MGB hit as a class object so as to group all the attributes.
Attributes:
hit_no: The rank number of the hit returned from MultiGeneBlast.
name: The name assigned to the hit rank.
source: The extended description of the hit/its sequence origin.
protein_no: The number of proteins with hits in the detected cluster.
MGB_score: The weighted MGB score used to rank the hits with synteny etc.
cubit_score: The cumulative bit-score of all the BLAST hits within the cluster.
"""
def __init__(self, hit_no, name, source, protein_no, MGB_score, cubit_score):
"""Initialise a MGB hit object"""
self.hit_no = 0
self.name = ""
self.source = ""
self.protein_no = 0
self.MGB_score = 0
self.cubit_score = 0
这意味着从输入文件中填充每个类实例的列表(并且还处理前两个段,可能是它们自己的类?)。
通过以下方式管理以获得一些方式:
def parse_section(file, delim1, delim2):
"""Separate files in to sections according to delimiter pairs"""
import re
regex = '{}(.*?){}'.format(delim1,delim2)
for result in re.findall(regex, file, re.S):
result = filter(None, result.split('\n'))
return result
def main():
"""Call functions and parse results."""
args = get_args()
header_section = []
sighits_section = []
details_section = []
with open(args.clusterfile,'r') as cfh:
content = cfh.read()
header_section.append(parse_section(content,"^", "Significant hits"))
sighits_section.append(parse_section(content,"Significant hits",">>"))
details_section.append(parse_section(content, ">>", ">>"))
if __name__ == "__main__":
main()
然而,这只给了我>>
分隔符开始的第一个匹配。有关捕获其他人的任何建议吗?