我是python的新手。我有一个大标题格式的输入文件,其中标题行以'>'开头。我的文件是:
>NC_23689
#
# XYZ
# Copyright (c) BLASC
#
# Predicted binding regions
# No. Start End Length
# 1 1 25 25
# 2 39 47 9
#
>68469409
#
# XYZ
# Copyright (c) BLASC
#
# Predicted binding regions
# None.
#
# Prediction profile output:
# Columns:
# 1 - Amino acid number
# 2 - One letter code
# 3 - probability value
# 4 - output
#
1 M 0.1325 0
2 S 0.1341 0
3 S 0.1384 0
>68464675
#
# XYZ
# Copyright (c) BLASC
#
# Predicted binding regions
# No. Start End Length
# 1 13 24 12
# 2 31 53 23
# 3 81 95 15
# 4 115 164 50
#
...
...
我想在(output.txt文件)中提取每个标头及其对应的Start-End值(在Predicted binding regions行之后)。对于上面的(input.txt),输出将是:
NC_23689: 1-25, 39-47
68464675: 13-24, 31-53, 81-95, 115-164
我试过了:
with open('input.txt') as infile, open('output.txt', 'w') as outfile:
copy = False
for line in infile:
if line.strip() == ">+":
copy = True
elif line.strip() == "# No. Start End Length":
copy = True
elif line.strip() == "#":
copy = False
elif copy:
outfile.write(line)
但它给了我:
# 1 1 25 25
# 2 39 47 9
# 1 13 24 12
# 2 31 53 23
# 3 81 95 15
# 4 115 164 50
这显然不对。我得到了范围,但没有标题描述符和一些额外的值。我怎样才能得到上面提到的输出? 感谢
聚苯乙烯。我在Windows7机器上使用python 2.7。
答案 0 :(得分:0)
试试这个:
with open("file.txt") as f:
first_time = True
for line in f:
line = line.rstrip()
if line.startswith(">"):
if not first_time:
if start_ends:
print("{}: {}".format(header,", ".join(start_ends)))
else:
first_time = False
header = line.lstrip(">")
start_ends = []
elif len(line.split()) == 5 and "".join(line.split()[1:]).isnumeric():
start_ends.append("{}-{}".format(line.split()[2],line.split()[3]))
if start_ends:
print("{}: {}".format(header,", ".join(start_ends)))
# Outputs:
# NC_23689: 1-25, 39-47
# 68464675: 13-24, 31-53, 81-95, 115-164