[已更新
问题
我有两个数据库:
1
1 Name: D-Tagatose 1,6-bisphosphate
2 Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl- myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol
3 Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione
4 Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine
5 Name: H+;: Hydron
2
> <NAME> Benzaldehyde, 4-[(trimethylsilyl)oxy]- > <SYNONYMS> Benzaldehyde, p-(trimethylsiloxy)-
> <NAME> Benzeneacetic acid, methyl ester > <SYNONYMS> q qer
> <NAME> Cyclopropaneoctanoic acid, 2-[[2-[(2-ethylcyclopropyl)methyl]cyclopropyl]methyl]-, methyl ester > <SYNONYMS> Methyl 8-[2-((2-[(2-ethylcyclopropyl)methyl]cyclopropyl)methyl)cyclopropyl]octanoate #
> <NAME> Mevalonic lactone, trimethylsilyl deriv. > <SYNONYMS> Mevalonic lactone, trimethylsilyl
> <NAME> Benzeneacetic acid, phenylmethyl ester > <SYNONYMS> Acetic acid, phenyl-, benzyl ester
期望的输出:
将数据库2中的名称或同义词与数据库1名称匹配。 我们正在谈论化合物,因为化合物名称的微小变化可能会发生。这也是我使用链接的在线数据库进行匹配的原因。
测试输入:
请参阅链接中的excel文件。 Data
使用以下数据库进行匹配)
小R输入:
输入1
structure(c("> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>",
"> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>",
"> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>",
"> <NAME>", " Benzaldehyde, 4-[(trimethylsilyl)oxy]-", " Benzeneacetic acid, methyl ester",
" Cyclopropaneoctanoic acid, 2-[[2-[(2-ethylcyclopropyl)methyl]cyclopropyl]methyl]-, methyl ester",
" Mevalonic lactone, trimethylsilyl deriv.", " Benzeneacetic acid, phenylmethyl ester",
" Butanoic acid, 3,3-dimethyl-, methyl ester", " Acetic acid, (4-(trifluoromethoxy)phenyl)methyl ester",
" Phosphoramidothioic acid, O,S-dimethyl ester", " Octanoic acid, phenylmethyl ester",
" Benzenepropanoic acid, methyl ester", " 2-Propenoic acid, 3-phenyl-, methyl ester",
" Propanoic acid, 2-methyl-, phenylmethyl ester", " Acetic acid, (2,3-dichlorophenyl)methyl ester",
" L-Methionine, methyl ester", " Butanoic acid, phenylmethyl ester",
"<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>",
"<SYNONYMS>", "> <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>",
"<SYNONYMS>", "<SYNONYMS>", "> <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>",
" Benzaldehyde, p-(trimethylsiloxy)-", " Acetic acid, phenyl-, methyl ester",
" Methyl 8-[2-((2-[(2-ethylcyclopropyl)methyl]cyclopropyl)methyl)cyclopropyl]octanoate #",
" Mevalonic lactone, trimethylsilyl", " Acetic acid, phenyl-, benzyl ester",
" Butyric acid, 3,3-dimethyl-, methyl ester", " NA", " Methamidophos",
" Octanoic acid, benzyl ester", " Hydrocinnamic acid, methyl ester",
" Cinnamic acid, methyl ester", " Isobutyric acid, benzyl ester",
" NA", " Methyl 2-amino-4-(methylsulfanyl)butanoate #", " Butyric acid, benzyl ester"
), .Dim = c(15L, 4L), .Dimnames = list(c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"),
c("NAME", NA, "NA.1", "NA.2")))
输入2
structure(c("Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol",
"Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione",
"Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine",
"Name: H+;: Hydron", "Name: 3-Iodo-L-tyrosine", "Name: 3-Methoxytyramine",
"Name: 3-Methoxy-4-hydroxyphenylacetaldehyde;: (4-Hydroxy-3-methoxyphenyl)acetaldehyde;: Homovanillin",
"Name: L-Noradrenaline;: Noradrenaline;: Norepinephrine;: Arterenol;: 4-[(1R)-2-Amino-1-hydroxyethyl]-1,2-benzenediol",
"Name: 3,4-Dihydroxymandelaldehyde;: 3,4-Dihydroxyphenylglycolaldehyde",
"Name: L-Metanephrine", "Name: L-Adrenaline;: (R)-(-)-Adrenaline;: (R)-(-)-Epinephrine;: (R)-(-)-Epirenamine;: (R)-(-)-Adnephrine;: 4-[(1R)-1-Hydroxy-2-(methylamino)ethyl]-1,2-benzenediol",
"Name: 3-Methoxy-4-hydroxyphenylglycolaldehyde", "Name: L-Normetanephrine",
"Name: L-Dopachrome;: 2-L-Carboxy-2,3-dihydroindole-5,6-quinone",
"Name: 5,6-Dihydroxyindole;: DHI"), .Dim = c(15L, 1L))
答案 0 :(得分:3)
我认为没有一个简单的问题解决方案。使用微笑或英寸/英寸可能会很痛苦,匹配常见或IUPAC名称并不容易。
您可以检查Pubchem PUG方法,以便按名称检索化合物: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc338920590
Python中的简单解决方案(在R中应该很容易实现)可以如下所示:
我将您的数据导出为纯文本文件:
input1.csv
Benzaldehyde, 4-[(trimethylsilyl)oxy]-; Benzaldehyde, p-(trimethylsiloxy)-
Benzeneacetic acid, methyl ester; Acetic acid, phenyl-, methyl ester
Cyclopropaneoctanoic acid, 2-[[2-[(2-ethylcyclopropyl)methyl]cyclopropyl]methyl]-, methyl ester; Methyl 8-[2-((2-[(2-ethylcyclopropyl)methyl]cyclopropyl)methyl)cyclopropyl]octanoate #
Mevalonic lactone, trimethylsilyl deriv.; Mevalonic lactone, trimethylsilyl
Benzeneacetic acid, phenylmethyl ester; Acetic acid, phenyl-, benzyl ester
Butanoic acid, 3,3-dimethyl-, methyl ester; Butyric acid, 3,3-dimethyl-, methyl ester
Acetic acid, (4-(trifluoromethoxy)phenyl)methyl ester; NA
Phosphoramidothioic acid, O,S-dimethyl ester; Methamidophos
Octanoic acid, phenylmethyl ester; Octanoic acid, benzyl ester
Benzenepropanoic acid, methyl ester; Hydrocinnamic acid, methyl ester
2-Propenoic acid, 3-phenyl-, methyl ester; Cinnamic acid, methyl ester
Propanoic acid, 2-methyl-, phenylmethyl ester; Isobutyric acid, benzyl ester
Acetic acid, (2,3-dichlorophenyl)methyl ester; NA
L-Methionine, methyl ester; Methyl 2-amino-4-(methylsulfanyl)butanoate #
Butanoic acid, phenylmethyl ester; Butyric acid, benzyl ester
input2.csv:
Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol
Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione
Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine
Name: H+;: Hydron
Name: 3-Iodo-L-tyrosine
Name: 3-Methoxytyramine
Name: 3-Methoxy-4-hydroxyphenylacetaldehyde;: (4-Hydroxy-3-methoxyphenyl)acetaldehyde;: Homovanillin
Name: L-Noradrenaline;: Noradrenaline;: Norepinephrine;: Arterenol;: 4-[(1R)-2-Amino-1-hydroxyethyl]-1,2-benzenediol
Name: 3,4-Dihydroxymandelaldehyde;: 3,4-Dihydroxyphenylglycolaldehyde
Name: L-Metanephrine
Name: L-Adrenaline;: (R)-(-)-Adrenaline;: (R)-(-)-Epinephrine;: (R)-(-)-Epirenamine;: (R)-(-)-Adnephrine;: 4-[(1R)-1-Hydroxy-2-(methylamino)ethyl]-1,2-benzenediol
Name: 3-Methoxy-4-hydroxyphenylglycolaldehyde
Name: L-Normetanephrine
Name: L-Dopachrome;: 2-L-Carboxy-2,3-dihydroindole-5,6-quinone
Name: 5,6-Dihydroxyindole;: DHI
Python代码:
import requests
def name_to_cids(name):
'''Retrive set of pubchem cids for given name'''
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{0}/cids/JSON'.format(name)
r = requests.get(url)
return set(r.json()["IdentifierList"]["CID"]) if r.status_code == 200 else set()
def names_to_cids(names):
'''Take list of names and return set of pubchem cids'''
cids = set()
for name in names:
cids = cids.union(name_to_cids(name))
return cids
def find_matching(from_key, from_dict, to_dict):
matching = []
from_cids = from_dict[from_key]['cids']
for k, v in to_dict.items():
if len(from_cids.intersection(v['cids'])) != 0:
matching.append(k)
return matching
#Read input files
input_1 = [line.replace('#','').strip().split(';') for line in open('input1.csv').readlines()]
input_2 = [[n for n in line.replace('Name: ', '').strip().split(';:') if n != 'NA'] for line in open('input2.csv').readlines()]
input_1_dict = {" ".join(names): {'names': names, 'cids': names_to_cids(names)} for names in input_1}
input_2_dict = {" ".join(names): {'names': names, 'cids': names_to_cids(names)} for names in input_2}
print(find_matching(from_key=input_1_dict.keys()[0], from_dict=input_1_dict, to_dict=input_2_dict))
Pubchem不会处理好的修改(例如你可以找到苯乙酸但不能用于苯乙酸,甲酯),因此根据您的需要,您可以考虑删除查询字符串的某些部分(即搜索苯乙酸而不是苯乙酸,甲酯)但我明白它远非好。
<强>更新强>
你也可以尝试更复杂的东西。
答案 1 :(得分:1)
如果您真的想要一个R解决方案,请尝试类似下面的内容。我真的认为你需要整理你的输入,特别是第二组的第四个元素。我所说的只适用于每种化学品的名字。我会留给你研究同义词。
并非所有化学名称在chemspider数据库中都有条目,您可能会有更多运气。捕获没有名称的条目是函数的一个重要部分,一切都会在没有它的情况下破坏。
您需要注册chemspider才能获得api的令牌。这样做是免费的。
您提供的示例化学名称似乎在两个数据集之间不匹配,因此下面的df3不会包含任何匹配项。我希望这有帮助。
library(RCurl)
library(XML)
token <- "your token here" # from chemspider profile
#url <- "http://www.chemspider.com/Search.asmx/AsyncSimpleSearch?query="
url <- "www.chemspider.com/Search.asmx/SimpleSearch?query="
chemCrawl <- function(chemname){ # Query chemspider with chemical names, return ids.
# df1[13] in particular seems to throw an error. Don't know why.
chem.id <-tryCatch(xmlValue(xmlRoot(xmlTreeParse(
getURL(paste(url, "\"", curlEscape(chemname), "\"" ,"&token=" ,
token, sep = ""))
))), error=function(err) {
"oops"} )
return(chem.id)
}
df1 <- as.data.frame(structure(c("> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>",
"> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>",
"> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>", "> <NAME>",
"> <NAME>", " Benzaldehyde, 4-[(trimethylsilyl)oxy]-", " Benzeneacetic acid, methyl ester",
" Cyclopropaneoctanoic acid, 2-[[2-[(2-ethylcyclopropyl)methyl]cyclopropyl]methyl]-, methyl ester",
" Mevalonic lactone, trimethylsilyl deriv.", " Benzeneacetic acid, phenylmethyl ester",
" Butanoic acid, 3,3-dimethyl-, methyl ester", " Acetic acid, (4-(trifluoromethoxy)phenyl)methyl ester",
" Phosphoramidothioic acid, O,S-dimethyl ester", " Octanoic acid, phenylmethyl ester",
" Benzenepropanoic acid, methyl ester", " 2-Propenoic acid, 3-phenyl-, methyl ester",
" Propanoic acid, 2-methyl-, phenylmethyl ester", " Acetic acid, (2,3-dichlorophenyl)methyl ester",
" L-Methionine, methyl ester", " Butanoic acid, phenylmethyl ester",
"<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>",
"<SYNONYMS>", "> <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>",
"<SYNONYMS>", "<SYNONYMS>", "> <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>",
" Benzaldehyde, p-(trimethylsiloxy)-", " Acetic acid, phenyl-, methyl ester",
" Methyl 8-[2-((2-[(2-ethylcyclopropyl)methyl]cyclopropyl)methyl)cyclopropyl]octanoate #",
" Mevalonic lactone, trimethylsilyl", " Acetic acid, phenyl-, benzyl ester",
" Butyric acid, 3,3-dimethyl-, methyl ester", " NA", " Methamidophos",
" Octanoic acid, benzyl ester", " Hydrocinnamic acid, methyl ester",
" Cinnamic acid, methyl ester", " Isobutyric acid, benzyl ester",
" NA", " Methyl 2-amino-4-(methylsulfanyl)butanoate #", " Butyric acid, benzyl ester"
), .Dim = c(15L, 4L), .Dimnames = list(c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"),
c("NAME", NA, "NA.1", "NA.2"))))
names(df1) <-c("class1", "name", "class2", "synonym")
df1$name <- as.character(df1$name)
df1[1,2] # there are leading spaces
df1$name <- sub(" ", "", df1$name) # lose the leading space
#details of chemspider search api: http://www.chemspider.com/Search.asmx
df1$chem.id <- lapply(df1$name, chemCrawl)
head(df1)
name2 <- structure(c("Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol",
"Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione",
"Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine",
"Name: H+;: Hydron", "Name: 3-Iodo-L-tyrosine", "Name: 3-Methoxytyramine",
"Name: 3-Methoxy-4-hydroxyphenylacetaldehyde;: (4-Hydroxy-3-methoxyphenyl)acetaldehyde;: Homovanillin",
"Name: L-Noradrenaline;: Noradrenaline;: Norepinephrine;: Arterenol;: 4-[(1R)-2-Amino-1-hydroxyethyl]-1,2-benzenediol",
"Name: 3,4-Dihydroxymandelaldehyde;: 3,4-Dihydroxyphenylglycolaldehyde",
"Name: L-Metanephrine", "Name: L-Adrenaline;: (R)-(-)-Adrenaline;: (R)-(-)-Epinephrine;: (R)-(-)-Epirenamine;: (R)-(-)-Adnephrine;: 4-[(1R)-1-Hydroxy-2-(methylamino)ethyl]-1,2-benzenediol",
"Name: 3-Methoxy-4-hydroxyphenylglycolaldehyde", "Name: L-Normetanephrine",
"Name: L-Dopachrome;: 2-L-Carboxy-2,3-dihydroindole-5,6-quinone",
"Name: 5,6-Dihydroxyindole;: DHI"), .Dim = c(15L, 1L))
name2 <- sub("Name: ", "", name2)
name2 <- sub(";.+$", "", name2)
chem.id <- rep(NA, 15)
df2 <- as.data.frame(cbind(name2, chem.id))
names(df2)[1] <- "name2"
df2$chem.id <- lapply(df2$name2, chemCrawl)
head(df2)
df1$chem.id <- as.character(df1$chem.id)
df2$chem.id <- as.character(df2$chem.id)
df3 <- merge(df1, df2, by = "chem.id", all = TRUE)
df3