从大型特定标头格式的文件中提取信息

时间:2016-08-30 10:01:37

标签: python

我是python的新手。我有一个大标题格式的输入文件,其中标题行以'>'开头。我的文件是:

>NC_23689
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                      1          25          25
#   2                     39          47           9
#
>68469409
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   None.
#
# Prediction profile output:
#   Columns:
#   1 - Amino acid number
#   2 - One letter code
#   3 -  probability value
#   4 - output
#
1   M     0.1325        0
2   S     0.1341        0
3   S     0.1384        0
>68464675
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                     13          24          12
#   2                     31          53          23
#   3                     81          95          15
#   4                    115         164          50
#
...
...

我想在(output.txt文件)中提取每个标头及其对应的Start-End值(在Predicted binding regions行之后)。对于上面的(input.txt),输出将是:

NC_23689: 1-25, 39-47
68464675: 13-24, 31-53, 81-95, 115-164

我试过了:

with open('input.txt') as infile, open('output.txt', 'w') as outfile:
   copy = False
   for line in infile:
        if line.strip() == ">+":
            copy = True
    elif line.strip() == "#   No.                Start         End      Length":
            copy = True
        elif line.strip() == "#":
            copy = False
        elif copy:
            outfile.write(line)

但它给了我:

#   1                      1          25          25
#   2                     39          47           9
#   1                     13          24          12
#   2                     31          53          23
#   3                     81          95          15
#   4                    115         164          50

这显然不对。我得到了范围,但没有标题描述符和一些额外的值。我怎样才能得到上面提到的输出? 感谢

聚苯乙烯。我在Windows7机器上使用python 2.7。

1 个答案:

答案 0 :(得分:0)

试试这个:

with open("file.txt") as f:
    first_time = True
    for line in f:
        line = line.rstrip()
        if line.startswith(">"):
            if not first_time:
                if start_ends:
                    print("{}: {}".format(header,", ".join(start_ends)))        
            else:
                first_time = False    
            header = line.lstrip(">")
            start_ends = []
        elif len(line.split()) == 5 and "".join(line.split()[1:]).isnumeric():
            start_ends.append("{}-{}".format(line.split()[2],line.split()[3]))
    if start_ends:
        print("{}: {}".format(header,", ".join(start_ends))) 

# Outputs:
# NC_23689: 1-25, 39-47
# 68464675: 13-24, 31-53, 81-95, 115-164