BioPython:氨基酸序列含有'J',无法计算分子量

时间:2017-02-10 12:43:20

标签: python bioinformatics xlrd biopython

我正在处理的数据来自Excel文件,其中索引1上有氨基酸序列。我正在尝试使用BioPython根据序列计算不同的属性。我现在的代码:

import xlrd
import sys
from Bio.SeqUtils.ProtParam import ProteinAnalysis

print '~~~~~~~~~~~~~~~ EXCEL PARSER FOR PVA/NON-PVA DATA ~~~~~~~~~~~~~~~'

print 'Path to Excel file:', str(sys.argv[1])
fname = sys.argv[1]
workbook = xlrd.open_workbook(fname, 'rU')

print ''
print 'The sheet names that have been found in the Excel file: '
sheet_names = workbook.sheet_names()
number_of_sheet = 1
for sheet_name in sheet_names:
    print '*', number_of_sheet, ':     ', sheet_name
    number_of_sheet += 1

with open("thefile.txt","w") as f:
    lines = []
    f.write('LENGTH.SEQUENCE,SEQUENCE,MOLECULAR.WEIGHT\n')
    for sheet_name in sheet_names:
        worksheet = workbook.sheet_by_name(sheet_name)
        print 'opened: ', sheet_name
        for i in range(1, worksheet.nrows):
            row = worksheet.row_values(i)
            analysed_seq = ProteinAnalysis(row[1].encode('utf-8'))
            weight = analysed_seq.molecular_weight()
            lines.append('{},{},{}\n'.format(row[2], row[1].encode('utf-8'), weight))
    f.writelines(lines)

直到我添加了分子量的计算才开始工作。这表明存在以下错误:

Traceback (most recent call last):
  File "Excel_PVAdata_Parser.py", line 28, in <module>
    weight = analysed_seq.molecular_weight()
  File "/usr/lib/python2.7/dist-packages/Bio/SeqUtils/ProtParam.py", line 114, in molecular_weight
    total_weight += aa_weights[aa]
KeyError: 'J'

我查看了Excel数据文件,这表明氨基酸序列确实含有J.是否有人知道BioPython的包装中有哪些“未知的氨基酸”或有其他建议?

2 个答案:

答案 0 :(得分:3)

Biopython使用IUPAC的蛋白质分子量,参见https://github.com/biopython/biopython/blob/master/Bio/Data/IUPACData.py

J是编码亮氨酸或异亮氨酸(L或I)的模糊氨基酸,用于核磁共振,其中不可能区分这些氨基酸。

根据您需要分子量的原因,您可能适合使用L和I来衡量体重的平均值吗?

答案 1 :(得分:2)

peterjc所述,J是编码亮氨酸(L)或异亮氨酸(I)的模糊氨基酸。两者都具有相同的分子量:

>>> from Bio.SeqUtils.ProtParam import ProteinAnalysis
>>> ProteinAnalysis('L').molecular_weight()
131.1729
>>> ProteinAnalysis('I').molecular_weight()
131.1729

因此,您可以暂时将所有J替换为LI来计算分子量。