我是Python的初学者(我是一名生物学家),我有一个文件,其中包含特定软件的结果,我想使用python解析结果。从以下输出中我想获得得分,并希望将序列分成单个氨基酸。
没有。得分序列
1 0.273778 FFHH-YYFLHRRRKKCCNNN-CCCK---HQQ---HHKKHV-FGGGE-EDDEDEEEEEEEE-EE--
2 0.394647 IIVVIVVVVIVVVVVVVVVV-CCCVA-IVVI--LIIIIIIIIYYYA-AVVVVVVVAAAAV-AST-
3 0.456667 FIVVIVVVVIXXXXIGGGGT-CCCCAV -------------IVBBB-AAAAAA--------AAAA-
4 0.407581 MMLMILLLLMVVAIILLIII-LLLIVLLAVVVVVAAAVAAVAIIII-ILIIIIIILVIMKKMLA-
5 0.331761 AANSRQSNAAQRRQCSNNNR-RALERGGMFFRRKQNNQKQKKHHHY-FYFYYSNNWWFFFFFFR-
6 0.452381 EEEEDEEEEEEEEEEEEEEE-EEEEESSTSTTTAEEEEEEEEEEEE-EEEEEEEEEEEEEEEEE-
7 0.460385 LLLLLLLLMMIIILLLIIII-IIILLVILMMEEFLLLLILIVLLLM-LLLLLLLLLLVILLLVL-
8 0.438680 ILILLVVVVILVVVLQLLMM-QKQLIVVLLVIIMLLLLMLLSIIIS-SMMMILFFLLILIIVVL-
9 0.393291 QQQDEEEQAAEEEDEKGSSD-QQEQDDQDEEAAAHQLESSATVVQR-QQQQQVVYTHSTVTTTE-
从上表中,我想获得一个具有相同数字,得分的表,但序列分开(列式) 所以看起来应该是
no. score amino acid(1st column)
1 0.273778 F
2 0.395657 I
3 0.456667 F
另一个代表第二列氨基酸的表
no score amino acid (2nd column)
1 0.273778 F
2 0.395657 I
3 0.456667 I
表示第三列氨基酸的第三表和第四列氨基酸的第四表等等
提前感谢您的帮助
答案 0 :(得分:5)
假设您已将包含此数据的文件打开为f
,那么您的示例可以通过以下方式重现:
for ln in f: # loop over all lines
seqno, score, seq = ln.split()
print("%s %s %s" % (seqno, score, seq[0]))
要拆分序列,您需要另外循环seq
中的字母:
for ln in f:
seqno, score, seq = ln.split()
for x in seq:
print("%s %s %s" % (seqno, score, seq[0]))
这将打印序列号并多次分数。我不确定这是不是你想要的。
答案 1 :(得分:0)
从你的例子中我猜:
以下是我的代码示例,它从input.dat
读取数据并将结果写入result-column-<number>.dat
:
import re
import sys
# I will write each table to different results-file.
# dictionary to map columns (numbers) to opened file objects:
resultfiles = {}
def get_result_file(column):
# helper to easily access results file.
if column not in resultfiles:
resultfiles[column] = open('result-column-%d.dat' % column, 'w')
return resultfiles[column]
# iterate over data:
for line in open('input.dat'):
try:
# str.split(separator, maxsplit)
# with `maxsplit`=2 it is more fail-proof:
no, score, seq = line.split(None, 2)
# from your example I guess that white-spaces in sequence are meaningless,
# however in your example one sequence contains white-space, so I remove it:
seq = re.sub('\s+', '', seq)
# data validation will help to spot problems early:
assert int(no), no
assert float(score), score
assert len(seq) == 65, seq
except Exception, e:
# print the error and continue to process data:
print >> sys.stderr, 'Error %s in line: %s.' % (e, line)
continue # jump to next iteration of for loop.
# int(), float() will rise ValueError if no or score aren't numbers
# assert <condition> will rise AssertionError if condition is False.
# iterate over each character in amino sequance:
for column, char in enumerate(seq, 1):
f = get_result_file(column)
f.write('%s %s %s\n' % (no, score, char))
# close all opened result files:
for f in resultfiles.values():
f.close()
此示例中使用的值得注意的函数:
答案 2 :(得分:0)
我认为创建表格并不重要 只需将数据放在一个适应的结构中,并使用一个功能,在您需要的时刻显示您需要的内容:
with open('bio.txt') as f:
data = [line.rstrip().split(None,2) for line in f if line.strip()]
def display(data,nth,pat='%-6s %-15s %s',uz=('th','st','nd','rd')):
print pat % ('no.','score',
'amino acid(%d%s column)' %(nth,uz[0 if nth//4 else nth]))
print '\n'.join(pat % (a,b,c[nth-1]) for a,b,c in data)
display(data,1)
print
display(data,3)
print
display(data,7)
结果
no. score amino acid(1st column)
1 0.273778 F
2 0.394647 I
3 0.456667 F
4 0.407581 M
5 0.331761 A
6 0.452381 E
7 0.460385 L
8 0.438680 I
9 0.393291 Q
no. score amino acid(3rd column)
1 0.273778 H
2 0.394647 V
3 0.456667 V
4 0.407581 L
5 0.331761 N
6 0.452381 E
7 0.460385 L
8 0.438680 I
9 0.393291 Q
no. score amino acid(7th column)
1 0.273778 Y
2 0.394647 V
3 0.456667 V
4 0.407581 L
5 0.331761 S
6 0.452381 E
7 0.460385 L
8 0.438680 V
9 0.393291 E
答案 3 :(得分:0)
这是一个简单的工作解决方案:
#opening file: "db.txt" full path to file if it is in the same directory as python file
#you can use any extension for the file ,'r' for reading mode
filehandler=open("db.txt",'r')
#Saving all the lines once in a list every line is a list member
#Another way: you can read it line by line
LinesList=filehandler.readlines()
#creating an empty multi dimension list to store your results
no=[]
Score=[]
AminoAcids=[] # this is a multi-dimensional list for example index 0 has a list of char. of first line and so on
#process each line assuming constant spacing in the input file
#no is the first char. score from char 4 to 12 and Amino from 16 to end
for Line in LinesList:
#add the no
no.append(Line[0])
#add the score
Score.append(Line[4:12])
Aminolist=list(Line[16:]) #breaking the amino acid as each character is a list element
#add Aminolist to the AminoAcids Matrix (multi-dimensional array)
AminoAcids.append(Aminolist)
#you can now play with the data!
#printing Tables ,you can also write them into a file instead
for k in range(0,65):
print"Table %d" %(k+1) # adding 1 to not be zero indexed
print"no. Score amino acid(column %d)" %(k+1)
for i in range(len(no)):
print "%s %s %s" %(no[i],Score[i],AminoAcids[i][k])
以下是控制台上显示的部分结果:
Table 1
no. Score amino acid(column 1)
1 0.273778 F
2 0.394647 I
3 0.456667 F
4 0.407581 M
5 0.331761 A
6 0.452381 E
7 0.460385 L
8 0.438680 I
9 0.393291 Q
Table 2
no. Score amino acid(column 2)
1 0.273778 F
2 0.394647 I
3 0.456667 I
4 0.407581 M
5 0.331761 A
6 0.452381 E
7 0.460385 L
8 0.438680 L
9 0.393291 Q
Table 3
no. Score amino acid(column 3)
1 0.273778 H
2 0.394647 V
3 0.456667 V
4 0.407581 L
5 0.331761 N
6 0.452381 E
7 0.460385 L
8 0.438680 I
9 0.393291 Q
Table 4
no. Score amino acid(column 4)
1 0.273778 H
2 0.394647 V
3 0.456667 V
4 0.407581 M
5 0.331761 S
6 0.452381 E
7 0.460385 L
8 0.438680 L
9 0.393291 D
Table 5
no. Score amino acid(column 5)
1 0.273778 -
2 0.394647 I
3 0.456667 I
4 0.407581 I
5 0.331761 R
6 0.452381 D
7 0.460385 L
8 0.438680 L
9 0.393291 E
Table 6
no. Score amino acid(column 6)
1 0.273778 Y
2 0.394647 V
3 0.456667 V
4 0.407581 L
5 0.331761 Q
6 0.452381 E
7 0.460385 L
8 0.438680 V
9 0.393291 E
Table 7
no. Score amino acid(column 7)
1 0.273778 Y
2 0.394647 V
3 0.456667 V
4 0.407581 L
5 0.331761 S
6 0.452381 E
7 0.460385 L
8 0.438680 V
9 0.393291 E