我有一个要解析的文件,我不知道哪个是制作正则表达式的最佳策略。我想获得de数据所在的行。 (我已经从线上获取了我想要的数据,但我意识到我留下了一些匹配,因为我的第一个正则表达式并不好。)
以下是我尝试过的一些正则表达式/策略:
找到标题并匹配以下所有内容,直到两个空行:
data_regex = re.compile("(?<= ------- ------ ----- ------- ------ ----- ---- -- -------- -----------\n)[^(\n)^(\n)^]+")
匹配的内容:
1.3e-26 92.9 13.7 4.3e-26 91.2 8.9 2.0 2 BPD_transp_1 Binding-protein-dependent transport system inne
4.7e-34 117.1 19.5 9e-34 116.2 13.5 1.4 1 BPD_transp_1 Binding-protein-dependent transport system inne
3.2e-153 509.4 5.2 3.6e-153 509.2 3.6 1.0 1 IMPDH IMP dehydrogenase / GMP reductase domain
1.3e-20 73.2 0.2 3.4e-19 68.6 0.1 2.5 3 DEAD DEAD/DEAH box helicase
6.9e-11 42.1 0.0 1.8e-09 37.5 0.0 2.4 2 CTP_transf_2 Cytidylyltransferase
正如你所看到的那样,它与某些数据相匹配,但并不是我想象的所有数据。但我尝试了另一个:
data_regex = re.compile("(?<= E-value score bias E-value score bias exp N Model Description\s)(.+\s)+")
在这个表达式中,我预计会有更多需要,包括---行,但我最终得到了这个:
3.6 7.2 11.6 0.13 11.9 3.6 2.0 2 Spore_YabQ Spore cortex protein YabQ (Spore_YabQ)
0.63 9.6 3.1 0.42 10.2 0.3 2.1 2 IBV_3C IBV 3C protein
0.38 9.6 4.8 0.65 8.9 0.8 2.6 3 PcrB PcrB family
0.059 12.6 0.3 1 8.6 0.0 2.8 3 DUF699 Putative ATPase (DUF699)
0.028 14.1 0.9 14 5.7 0.0 3.8 4 HEAT HEAT repeat
再次,一些结果,但不是我的预期
多次找到数字分隔的结构,然后找到单词:
data_regex = re.compile("(\s+([+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s+)(\w+\s)+")
但它找到了许多数字,而不是数字空格,我想要的单词:
(' 2010 ', '2010', 'Medical ')
(' 1 ', '1', 'domain ')
(' 1.5 ', '1.5', '1 ')
(' 6.2e-27 ', '6.2e-27', '12 ')
(' 17 ', '17', '129 ')
(' 7 ', '7', '130 ')
(' 0.92\n\n ', '0.92', 'each ')
(' 5.2e-31\n ', '5.2e-31', 'PucR ')
我用它来获得比赛
data_result = re.findall(data_regex, document)
print data_result
我正在解析的数据类型,文件的摘录:
# CPU time: 0.66u 0.50s 00:00:01.16 Elapsed: 00:00:00.55
# Mc/sec: 902.81
//
Query: LD_216 [L=247]
Description: # 237337 # 238077 # 1 # ID=1_216;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.390
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
3e-24 85.3 0.0 5.2e-24 84.5 0.0 1.4 1 ABC_tran ABC transporter
3.2e-11 42.5 0.3 9.7e-11 40.9 0.2 1.7 1 SMC_N RecF/RecN/SMC N terminal domain
3.1e-05 22.4 0.1 0.17 10.1 0.0 2.6 2 ABC_ATPase Predicted ATPase of the ABC class
6.5e-05 21.8 0.1 0.0001 21.2 0.0 1.3 1 DUF258 Protein of unknown function, DUF258
0.001 19.0 0.5 0.21 11.5 0.0 2.2 2 AAA ATPase family associated with various cellular
0.0019 16.4 0.1 0.0046 15.1 0.0 1.6 2 DLIC Dynein light intermediate chain (DLIC)
0.0032 15.8 0.1 0.028 12.7 0.0 2.0 2 Adeno_IVa2 Adenovirus IVa2 protein
------ inclusion threshold ------
0.016 14.5 0.3 0.037 13.4 0.2 1.8 1 Arch_ATPase Archaeal ATPase
0.018 14.3 0.0 0.046 13.0 0.0 1.6 1 UPF0079 Uncharacterised P-loop hydrolase UPF0079
0.02 13.3 0.2 0.041 12.3 0.1 1.4 1 Rad17 Rad17 cell cycle checkpoint protein
0.026 13.7 0.1 0.049 12.8 0.0 1.4 1 PduV-EutP Ethanolamine utilisation - propanediol utilisat
0.046 12.2 0.0 0.085 11.4 0.0 1.5 1 GSPII_E Type II/IV secretion system protein
0.05 12.4 0.0 0.087 11.6 0.0 1.4 1 Mg_chelatase Magnesium chelatase, subunit ChlI
0.054 12.0 0.2 0.094 11.2 0.2 1.7 1 NB-ARC NB-ARC domain
0.056 12.9 0.1 0.15 11.5 0.1 1.8 1 MobB Molybdopterin guanine dinucleotide synthesis pr
0.059 12.0 0.4 8.9 4.8 0.0 2.4 2 KAP_NTPase KAP family P-loop domain
0.079 12.3 0.3 0.57 9.5 0.1 2.1 2 AAA_5 AAA domain (dynein-related subfamily)
0.086 11.9 0.2 0.32 10.0 0.0 2.0 2 IstB IstB-like ATP binding protein
0.13 11.0 1.6 3.5 6.3 0.1 2.7 3 KaiC KaiC
0.23 11.3 1.3 0.92 9.4 0.1 2.7 4 RNA_helicase RNA helicase
Domain annotation for each model (and alignments):
>> ABC_tran ABC transporter
# Here begins other type of data but above there are two empty lines
------ inclusion threshold ------
行可以在------- ------ ----- ------- ------ ----- ---- -- -------- -----------
行之后或随机位置。如果可能的话,我想知道它与每一行匹配的位置,因为如果它们包含在阈值中,我将需要对它们进行不同的处理。
如何获取文件的这些行?
预期产出:
3e-24 85.3 0.0 5.2e-24 84.5 0.0 1.4 1 ABC_tran ABC transporter
3.2e-11 42.5 0.3 9.7e-11 40.9 0.2 1.7 1 SMC_N RecF/RecN/SMC N terminal domain
3.1e-05 22.4 0.1 0.17 10.1 0.0 2.6 2 ABC_ATPase Predicted ATPase of the ABC class
6.5e-05 21.8 0.1 0.0001 21.2 0.0 1.3 1 DUF258 Protein of unknown function, DUF258
0.001 19.0 0.5 0.21 11.5 0.0 2.2 2 AAA ATPase family associated with various cellular
0.0019 16.4 0.1 0.0046 15.1 0.0 1.6 2 DLIC Dynein light intermediate chain (DLIC)
0.0032 15.8 0.1 0.028 12.7 0.0 2.0 2 Adeno_IVa2 Adenovirus IVa2 protein
0.016 14.5 0.3 0.037 13.4 0.2 1.8 1 Arch_ATPase Archaeal ATPase
0.018 14.3 0.0 0.046 13.0 0.0 1.6 1 UPF0079 Uncharacterised P-loop hydrolase UPF0079
0.02 13.3 0.2 0.041 12.3 0.1 1.4 1 Rad17 Rad17 cell cycle checkpoint protein
0.026 13.7 0.1 0.049 12.8 0.0 1.4 1 PduV-EutP Ethanolamine utilisation - propanediol utilisat
0.046 12.2 0.0 0.085 11.4 0.0 1.5 1 GSPII_E Type II/IV secretion system protein
0.05 12.4 0.0 0.087 11.6 0.0 1.4 1 Mg_chelatase Magnesium chelatase, subunit ChlI
0.054 12.0 0.2 0.094 11.2 0.2 1.7 1 NB-ARC NB-ARC domain
0.056 12.9 0.1 0.15 11.5 0.1 1.8 1 MobB Molybdopterin guanine dinucleotide synthesis pr
0.059 12.0 0.4 8.9 4.8 0.0 2.4 2 KAP_NTPase KAP family P-loop domain
0.079 12.3 0.3 0.57 9.5 0.1 2.1 2 AAA_5 AAA domain (dynein-related subfamily)
0.086 11.9 0.2 0.32 10.0 0.0 2.0 2 IstB IstB-like ATP binding protein
0.13 11.0 1.6 3.5 6.3 0.1 2.7 3 KaiC KaiC
0.23 11.3 1.3 0.92 9.4 0.1 2.7 4 RNA_helicase RNA helicase
修改
我最后更改了使用readlines()
读取文件,然后为每行执行以下操作:
elif lines.startswith(" "):
data_regex = re.compile("-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?")#Matches numbers
data_result = re.findall(data_regex, lines)
data_regex2 = re.compile("[?!]") # Some other characters found
data_result2 = re.findall(data_regex2, lines)
data_regex3 = re.compile("-{2,}") # Finds where are the ----- lines
data_result3 = re.findall(data_regex3, lines)
# There are numbers in the line and there are 10 or more words and numbers (8 numbers
# and plus id and description), and it doesn't have any "strange" character or it is
# a --- line
if data_result != [] and len(lines.split()) >= 10 and data_result2 == [] and data_result3 == []:
print lines[:-1]
答案 0 :(得分:0)
我的建议:
-----blablabla-----
等,以便您拥有仅包含数据列的文件如果您使用numpy
,请假设列以tab
分隔。
#!/usr/bin/env python
import numpy as np
dat = np.genfromtxt('data.txt', delimiter='\t', dtype=str)
dat
将包含类型为str
的二维数组中的所有数字和单词,然后dat[:,0:7]
将包含所有数字。
答案 1 :(得分:0)
我在读完文件行之后最终做了这个正则表达式。
data_regex = re.compile("^ {3,10}((-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*){8}.+")
它检查行开头({3,10}
)的足够空格(^
)以避免其他数据,后跟8({8}
)个数字(-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?
) (\s*
)与行的其余部分(.+
)