我有微笑格式的分子数量,我想从分子的微笑格式获得分子名称,我想使用python进行转换。
例如:
CN1CCC[C@H]1c2cccnc2 - Nicotine
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2 - Thiamin
哪个python模块可以帮助我进行这样的转换? 请告诉我。
答案 0 :(得分:1)
我不知道任何一个可以让你这样做的模块,我不得不在数据争夺者那里玩,试图得到一个满意的答案。
我使用维基百科来解决这个问题,而维基百科正在越来越多地用于结构化生物信息学/化学信息学数据,但事实证明我的程序显示许多数据不正确。
我使用urllib向dbpedia提交SPARQL查询,首先搜索微笑字符串,并且无法搜索化合物的分子量。
import sys
import urllib
import urllib2
import traceback
import pybel
import json
def query(q,epr,f='application/json'):
try:
params = {'query': q}
params = urllib.urlencode(params)
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(epr+'?'+params)
request.add_header('Accept', f)
request.get_method = lambda: 'GET'
url = opener.open(request)
return url.read()
except Exception, e:
traceback.print_exc(file=sys.stdout)
raise e
url = 'http://dbpedia.org/sparql'
q1 = '''
select ?name where {
?s <http://dbpedia.org/property/smiles> "%s"@en.
?s rdfs:label ?name.
FILTER(LANG(?name) = "" || LANGMATCHES(LANG(?name), "en"))
}
limit 10
'''
q2 = '''
select ?name where {
?s <http://dbpedia.org/property/molecularWeight> '%s'^^xsd:double.
?s rdfs:label ?name.
FILTER(LANG(?name) = "" || LANGMATCHES(LANG(?name), "en"))
}
limit 10
'''
smiles = filter(None, '''
CN1CCC[C@H]1c2cccnc2
CN(CCC1)[C@@H]1C2=CC=CN=C2
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2
Cc1nnc2CN=C(c3ccccc3)c4cc(Cl)ccc4-n12
CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
CCc1nn(C)c2c(=O)[nH]c(nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4
CC(C)(N)Cc1ccccc1
CN(C)C(=O)Cc1c(nc2ccc(C)cn12)c3ccc(C)cc3
COc1ccc2[nH]c(nc2c1)S(=O)Cc3ncc(C)c(OC)c3C
CCN(CC)C(=O)[C@H]1CN(C)[C@@H]2Cc3c[nH]c4cccc(C2=C1)c34
'''.splitlines())
OBMolecules = {}
for smile in smiles:
try:
OBMolecules[smile] = pybel.readstring('smi', smile)
except Exception as e:
print e
for smi in smiles:
print '--------------'
print smi
try:
print "searching by smiles string.."
results = json.loads(query(q1 % smi, url))
if len(results['results']['bindings']) == 0:
raise Exception('no results from smiles')
else:
print 'NAME: ', results['results']['bindings'][0]['name']['value']
except Exception as e:
print e
try:
mol_weight = round(OBMolecules[smi].molwt, 2)
print "search ing by molecular weight %s" % mol_weight
results = json.loads(query(q2 % mol_weight, url))
if len(results['results']['bindings']) == 0:
raise Exception('no results from molecular weight')
else:
print 'NAME: ', results['results']['bindings'][0]['name']['value']
except Exception as e:
print e
...输出
--------------
CN1CCC[C@H]1c2cccnc2
searching by smiles string..
no results from smiles
search ing by molecular weight 162.23
NAME: Anabasine
--------------
CN(CCC1)[C@@H]1C2=CC=CN=C2
searching by smiles string..
no results from smiles
search ing by molecular weight 162.23
NAME: Anabasine
--------------
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2
searching by smiles string..
no results from smiles
search ing by molecular weight 267.37
NAME: Pipradrol
--------------
Cc1nnc2CN=C(c3ccccc3)c4cc(Cl)ccc4-n12
searching by smiles string..
no results from smiles
search ing by molecular weight 308.76
no results from molecular weight
--------------
CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
searching by smiles string..
no results from smiles
search ing by molecular weight 284.74
NAME: Mazindol
--------------
CCc1nn(C)c2c(=O)[nH]c(nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4
searching by smiles string..
no results from smiles
search ing by molecular weight 460.55
no results from molecular weight
--------------
CC(C)(N)Cc1ccccc1
searching by smiles string..
no results from smiles
search ing by molecular weight 149.23
NAME: Phenpromethamine
--------------
CN(C)C(=O)Cc1c(nc2ccc(C)cn12)c3ccc(C)cc3
searching by smiles string..
no results from smiles
search ing by molecular weight 307.39
NAME: Talastine
--------------
COc1ccc2[nH]c(nc2c1)S(=O)Cc3ncc(C)c(OC)c3C
searching by smiles string..
no results from smiles
search ing by molecular weight 345.42
no results from molecular weight
--------------
CCN(CC)C(=O)[C@H]1CN(C)[C@@H]2Cc3c[nH]c4cccc(C2=C1)c34
searching by smiles string..
no results from smiles
search ing by molecular weight 323.43
NAME: Lysergic acid diethylamide
正如你所看到的那样应该尼古丁的前两个结果出错了,这是因为尼古丁的维基百科条目报告了分子量场中的分子量。
答案 1 :(得分:1)
open babel documentation中有关于您可能想要查看的相似性搜索的部分,您可以将其与来自Chembl的sdl文件合并。
我会稍后再说一遍,因为它比我以前的答案更富有成效!
答案 2 :(得分:0)
参考:NCI/CADD 从urllib.request导入urlopen
def CIRconvert(smi):
try:
url ="https://cactus.nci.nih.gov/chemical/structure/" + smi+"/iupac_name"
ans = urlopen(url).read().decode('utf8')
return ans
except:
return 'Name Not Available'
smiles = 'CCCCC(C)CC'
print(smiles, CIRconvert(smiles))
输出: CCCCC(C)CC- 3-甲基庚烷