所以我有一个如下所示的文本表:
BLOCK 1. MARKERS: 1 2
42 (0.500) |0.269 0.166 0.041 0.024|
21 (0.351) |0.069 0.119 0.079 0.084|
22 (0.149) |0.054 0.040 0.055 0.000|
Multiallelic Dprime: 0.295
BLOCK 2. MARKERS: 9 10 11 12
1123 (0.392) |0.351 0.037|
2341 (0.324) |0.277 0.043|
2121 (0.176) |0.016 0.164|
1121 (0.108) |0.073 0.036|
Multiallelic Dprime: 0.591
BLOCK 3. MARKERS: 13 14
13 (0.716)
34 (0.284)
对于每个区块,我只需要以下信息:
BLOCK1:
42 0.500
21 0.351
22 0.149
我在解析个别行时没有任何问题。并提取我需要的东西。可能是列表的列表,应该是我的目标。我的问题是我无法读取每个块的确切行数,而不会在最后得到错误。
所以我写了这个丑陋的代码:
file = open('haplotypes_hetero.txt')
to_parse = []
for line in file:
to_parse.append(line.strip())
to_parse_2=[]
for line in to_parse:
line = line.split()
to_parse_2.append(line)
for i in range(len(to_parse_2)):
if to_parse_2[i][0]=='BLOCK':
z=i
if z < len(to_parse_2):
z+=1
while to_parse_2[z][0] != 'BLOCK':
print to_parse_2[z][0]
z+=1
if z>len(to_parse_2):
z=0
file.close()
它有点工作,并打印它应该的东西。但是我最后收到了一个错误。
42
21
22
Multiallelic
1123
2341
2121
1121
Multiallelic
13
34
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
如何摆脱索引错误?
答案 0 :(得分:3)
我认为问题出在这个
上if z>len(to_parse_2):
z=0
因为只有当Z值大于列表长度时,程序才会检查。当Z值等于列表长度时,不应增加Z.所以将这些行改为
if z >= len(to_parse_2) : #changed '>' to >=
z=0
答案 1 :(得分:2)
对不起,已经等不及了..
>>> s='''BLOCK 1. MARKERS: 1 2
... ... 42 (0.500) |0.269 0.166 0.041 0.024|
... ... 21 (0.351) |0.069 0.119 0.079 0.084|
... ... 22 (0.149) |0.054 0.040 0.055 0.000|
... ... Multiallelic Dprime: 0.295
... ... BLOCK 2. MARKERS: 9 10 11 12
... ... 1123 (0.392) |0.351 0.037|
... ... 2341 (0.324) |0.277 0.043|
... ... 2121 (0.176) |0.016 0.164|
... ... 1121 (0.108) |0.073 0.036|
... ... Multiallelic Dprime: 0.591
... ... BLOCK 3. MARKERS: 13 14
... ... 13 (0.716)
... ... 34 (0.284)'''
>>> re.findall(r'(?:(\d+)\s+\(([\d.]+)\)|(BLOCK \d+))',s)
[('', '', 'BLOCK 1'), ('42', '0.500', ''), ('21', '0.351', ''), ('22', '0.149', ''), ('', '', 'BLOCK 2'), ('1123', '0.392', ''), ('2341', '0.324', ''), ('2121', '0.176', ''), ('1121', '0.108', ''), ('', '', 'BLOCK 3'), ('13', '0.716', ''), ('34', '0.284', '')]
此:
file = open('haplotypes_hetero.txt')
to_parse = []
for line in file:
to_parse.append(line.strip())
to_parse_2=[]
for line in to_parse:
line = line.split()
to_parse_2.append(line)
可以替换为:
to_parse_2 = [ l.split() for l in open('haplotypes_hetero.txt').realines() ]
我强烈建议您学习python's list comprehensions
答案 2 :(得分:2)
您可以尝试这样的事情:
table='''\
BLOCK 1. MARKERS: 1 2
42 (0.500) |0.269 0.166 0.041 0.024|
21 (0.351) |0.069 0.119 0.079 0.084|
22 (0.149) |0.054 0.040 0.055 0.000|
Multiallelic Dprime: 0.295
BLOCK 2. MARKERS: 9 10 11 12
1123 (0.392) |0.351 0.037|
2341 (0.324) |0.277 0.043|
2121 (0.176) |0.016 0.164|
1121 (0.108) |0.073 0.036|
Multiallelic Dprime: 0.591
BLOCK 3. MARKERS: 13 14
13 (0.716)
34 (0.284)'''
import re
d={}
for title, block in re.findall(r'^(BLOCK \d+)\..*?\n(.*?)(?=^BLOCK|\Z)', table, flags=re.M | re.S):
d[title]=[]
for line in block.splitlines():
print line
t=line.partition(')')[0].partition('(')
try:
d[title].append(map(float, [t[0], t[2]]))
except ValueError:
pass
for k, v in d.items():
print k,':',v
打印:
BLOCK 1 : [[42.0, 0.5], [21.0, 0.351], [22.0, 0.149]]
BLOCK 2 : [[1123.0, 0.392], [2341.0, 0.324], [2121.0, 0.176], [1121.0, 0.108]]
BLOCK 3 : [[13.0, 0.716], [34.0, 0.284]]
答案 3 :(得分:1)
您不需要一些复杂的方法来解决此类问题,您可以使用regex
:
>>> s="""BLOCK 1. MARKERS: 1 2
... 42 (0.500) |0.269 0.166 0.041 0.024|
... 21 (0.351) |0.069 0.119 0.079 0.084|
... 22 (0.149) |0.054 0.040 0.055 0.000|
... Multiallelic Dprime: 0.295
... BLOCK 2. MARKERS: 9 10 11 12
... 1123 (0.392) |0.351 0.037|
... 2341 (0.324) |0.277 0.043|
... 2121 (0.176) |0.016 0.164|
... 1121 (0.108) |0.073 0.036|
... Multiallelic Dprime: 0.591
... BLOCK 3. MARKERS: 13 14
... 13 (0.716)
... 34 (0.284)"""
>>>
>>>
>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)
>>> [(i[-2],re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0])) for i in l]
[('BLOCK 1.', [('42', '0.500'), ('21', '0.351'), ('22', '0.149')]), ('BLOCK 2.', [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')]), ('BLOCK 3.', [('13', '0.716'), ('34', '0.284')])]
首先,你需要提取块,你可以使用re.findall
使用以下正则表达式:
>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)
然后你可以使用r'(\d+)\s+\(([\d.]+)\)
来匹配一个后跟一个或多个空格的数字,然后是括号内带点的数字组合。
作为旁注,((?!BLOCK).)*
将匹配任何不包含单词BLOCK
的字符串,有关正则表达式的更多信息,我建议您查看解释的http://www.regular-expressions.info/lookaround.html关于正则表达式中的look-around
!
也可以使用词典理解而不是列表理解:
>>> {i[-2]:re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0]) for i in l}
{'BLOCK 1.': [('42', '0.500'), ('21', '0.351'), ('22', '0.149')],
'BLOCK 2.': [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')],
'BLOCK 3.': [('13', '0.716'), ('34', '0.284')]}