使用python解析结果

时间:2011-12-12 11:45:25

标签: python parsing

我是Python的初学者(我是一名生物学家),我有一个文件,其中包含特定软件的结果,我想使用python解析结果。从以下输出中我想获得得分,并希望将序列分成单个氨基酸。

没有。得分序列

1   0.273778    FFHH-YYFLHRRRKKCCNNN-CCCK---HQQ---HHKKHV-FGGGE-EDDEDEEEEEEEE-EE--
2   0.394647    IIVVIVVVVIVVVVVVVVVV-CCCVA-IVVI--LIIIIIIIIYYYA-AVVVVVVVAAAAV-AST-
3   0.456667        FIVVIVVVVIXXXXIGGGGT-CCCCAV -------------IVBBB-AAAAAA--------AAAA-  
4   0.407581    MMLMILLLLMVVAIILLIII-LLLIVLLAVVVVVAAAVAAVAIIII-ILIIIIIILVIMKKMLA-
5   0.331761    AANSRQSNAAQRRQCSNNNR-RALERGGMFFRRKQNNQKQKKHHHY-FYFYYSNNWWFFFFFFR-
6   0.452381    EEEEDEEEEEEEEEEEEEEE-EEEEESSTSTTTAEEEEEEEEEEEE-EEEEEEEEEEEEEEEEE-
7   0.460385    LLLLLLLLMMIIILLLIIII-IIILLVILMMEEFLLLLILIVLLLM-LLLLLLLLLLVILLLVL-
8   0.438680    ILILLVVVVILVVVLQLLMM-QKQLIVVLLVIIMLLLLMLLSIIIS-SMMMILFFLLILIIVVL-
9   0.393291    QQQDEEEQAAEEEDEKGSSD-QQEQDDQDEEAAAHQLESSATVVQR-QQQQQVVYTHSTVTTTE-

从上表中,我想获得一个具有相同数字,得分的表,但序列分开(列式)  所以看起来应该是

no.      score         amino acid(1st column)

1      0.273778         F

2      0.395657         I

3      0.456667         F

另一个代表第二列氨基酸的表

no       score       amino acid (2nd column)

1       0.273778         F

2       0.395657         I

3       0.456667         I  
表示第三列氨基酸的第三表和第四列氨基酸的第四表等等

提前感谢您的帮助

4 个答案:

答案 0 :(得分:5)

假设您已将包含此数据的文件打开为f,那么您的示例可以通过以下方式重现:

for ln in f:    # loop over all lines
    seqno, score, seq = ln.split()
    print("%s    %s    %s" % (seqno, score, seq[0]))

要拆分序列,您需要另外循环seq中的字母:

for ln in f:
    seqno, score, seq = ln.split()
    for x in seq:
        print("%s    %s    %s" % (seqno, score, seq[0]))

这将打印序列号并多次分数。我不确定这是不是你想要的。

答案 1 :(得分:0)

从你的例子中我猜:

  • 您希望将每个表保存到不同的结果文件。
  • 每个序列长度为65个字符
  • 某些序列包含无意义的空格,必须将其删除(在您的示例中为第3行)

以下是我的代码示例,它从input.dat读取数据并将结果写入result-column-<number>.dat

import re
import sys

# I will write each table to different results-file.
# dictionary to map columns (numbers) to opened file objects:
resultfiles = {}


def get_result_file(column):
    # helper to easily access results file.
    if column not in resultfiles:
        resultfiles[column] = open('result-column-%d.dat' % column, 'w')
    return resultfiles[column]


# iterate over data:
for line in open('input.dat'):
    try:
        # str.split(separator, maxsplit)
        # with `maxsplit`=2 it is more fail-proof:
        no, score, seq = line.split(None, 2)

        # from your example I guess that white-spaces in sequence are meaningless,
        # however in your example one sequence contains white-space, so I remove it:
        seq = re.sub('\s+', '', seq)

        # data validation will help to spot problems early:
        assert int(no), no          
        assert float(score), score
        assert len(seq) == 65, seq

    except Exception, e:
        # print the error and continue to process data:
        print >> sys.stderr, 'Error %s in line: %s.' % (e, line)
        continue  # jump to next iteration of for loop.

    # int(), float() will rise ValueError if no or score aren't numbers
    # assert <condition> will rise AssertionError if condition is False.

    # iterate over each character in amino sequance:
    for column, char in enumerate(seq, 1):
        f = get_result_file(column)
        f.write('%s    %s    %s\n' % (no, score, char))


# close all opened result files:
for f in resultfiles.values():
    f.close()

此示例中使用的值得注意的函数:

答案 2 :(得分:0)

我认为创建表格并不重要 只需将数据放在一个适应的结构中,并使用一个功能,在您需要的时刻显示您需要的内容:

with open('bio.txt') as f:
    data = [line.rstrip().split(None,2) for line in f if line.strip()]


def display(data,nth,pat='%-6s  %-15s  %s',uz=('th','st','nd','rd')):
    print pat % ('no.','score',
                 'amino acid(%d%s column)' %(nth,uz[0 if nth//4 else nth]))
    print '\n'.join(pat % (a,b,c[nth-1]) for a,b,c in data)    

display(data,1)
print
display(data,3)
print
display(data,7)

结果

no.     score            amino acid(1st column)
1       0.273778         F
2       0.394647         I
3       0.456667         F
4       0.407581         M
5       0.331761         A
6       0.452381         E
7       0.460385         L
8       0.438680         I
9       0.393291         Q

no.     score            amino acid(3rd column)
1       0.273778         H
2       0.394647         V
3       0.456667         V
4       0.407581         L
5       0.331761         N
6       0.452381         E
7       0.460385         L
8       0.438680         I
9       0.393291         Q

no.     score            amino acid(7th column)
1       0.273778         Y
2       0.394647         V
3       0.456667         V
4       0.407581         L
5       0.331761         S
6       0.452381         E
7       0.460385         L
8       0.438680         V
9       0.393291         E

答案 3 :(得分:0)

这是一个简单的工作解决方案:

#opening file: "db.txt" full path to file if it is in the same directory as python file 
#you can use any extension for the file ,'r' for reading mode
filehandler=open("db.txt",'r') 
#Saving all the lines once in a list every line is a list member
#Another way: you can read it line by line
LinesList=filehandler.readlines()
#creating an empty multi dimension list to store your results
no=[]
Score=[]
AminoAcids=[] # this is a multi-dimensional list for example index 0 has a list of char. of first line and so on
#process each line assuming constant spacing in the input file
#no is the first char. score from char 4 to 12 and Amino from 16 to end
for Line in LinesList:
    #add the no
    no.append(Line[0])
    #add the score
    Score.append(Line[4:12])
    Aminolist=list(Line[16:]) #breaking the amino acid as each character is a list element
    #add Aminolist to the AminoAcids Matrix (multi-dimensional array)
    AminoAcids.append(Aminolist)

#you can now play with the data!
#printing Tables ,you can also write them into a file instead
for k in range(0,65):
    print"Table %d" %(k+1) # adding 1 to not be zero indexed
    print"no. Score      amino acid(column %d)" %(k+1)
    for i in range(len(no)):
        print "%s   %s   %s" %(no[i],Score[i],AminoAcids[i][k])

以下是控制台上显示的部分结果:

Table 1
no. Score      amino acid(column 1)
1   0.273778   F
2   0.394647   I
3   0.456667   F
4   0.407581   M
5   0.331761   A
6   0.452381   E
7   0.460385   L
8   0.438680   I
9   0.393291   Q
Table 2
no. Score      amino acid(column 2)
1   0.273778   F
2   0.394647   I
3   0.456667   I
4   0.407581   M
5   0.331761   A
6   0.452381   E
7   0.460385   L
8   0.438680   L
9   0.393291   Q
Table 3
no. Score      amino acid(column 3)
1   0.273778   H
2   0.394647   V
3   0.456667   V
4   0.407581   L
5   0.331761   N
6   0.452381   E
7   0.460385   L
8   0.438680   I
9   0.393291   Q
Table 4
no. Score      amino acid(column 4)
1   0.273778   H
2   0.394647   V
3   0.456667   V
4   0.407581   M
5   0.331761   S
6   0.452381   E
7   0.460385   L
8   0.438680   L
9   0.393291   D
Table 5
no. Score      amino acid(column 5)
1   0.273778   -
2   0.394647   I
3   0.456667   I
4   0.407581   I
5   0.331761   R
6   0.452381   D
7   0.460385   L
8   0.438680   L
9   0.393291   E
Table 6
no. Score      amino acid(column 6)
1   0.273778   Y
2   0.394647   V
3   0.456667   V
4   0.407581   L
5   0.331761   Q
6   0.452381   E
7   0.460385   L
8   0.438680   V
9   0.393291   E
Table 7
no. Score      amino acid(column 7)
1   0.273778   Y
2   0.394647   V
3   0.456667   V
4   0.407581   L
5   0.331761   S
6   0.452381   E
7   0.460385   L
8   0.438680   V
9   0.393291   E