我想使用提供的API从R中的NIST网络网站检索与给定CAS注册号(Chemical Abstracts Service nr)相关的信息。
E.g。对于cas nr。 “19431-79-9”(Caryophylladienol II), http://webbook.nist.gov/cgi/cbook.cgi?ID=19431-79-9&Units=SI&Mask=2000#Gas-Chrom 我到目前为止
casno = "19431-79-9"
casno2 = gsub("-", "", casno)
raw=readLines(paste('http://webbook.nist.gov/cgi/cbook.cgi?ID=',casno,'&Units=SI&Mask=2000#Gas-Chrom', sep=""))
# mass spec, empty here, but not e.g. for casno2="630035"
casno2="630035"
jcampfile = readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?JCAMP=C",casno2,"&Index=0&Type=Mass",sep=""))
if (jcampfile[[1]]=="##TITLE=Spectrum not found.") jcampfile=NA
casno2 = gsub("-", "", casno)
# molecular stucture
molfile2d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str2File=C",casno2,sep=""))
if (molfile2d==character(0)) molfile2d=NA
molfile3d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str3File=C",casno2,sep=""))
if (molfile3d==character(0)) molfile3d=NA
从原始输出的以下位,我想提取以下变量&列表:
"name=\" Top \">Caryophylladienol II</a></h1>"
-> name="Caryophylladienol II"
"Formula</a>:</strong> C<sub>15</sub>H<sub>24</sub>O</li>\n \n \n<li><strong>"
-> formula="C15H24O"
"Molecular weight</a>:</strong> 220.3505</li>\n \n \n<li>"
-> MW=220.3505
"IUPAC Standard InChI:</strong>\n \n<br /><table>\n<tr><td>\n<ul style=\" list-style-type: circle;\">\n<li><tt>InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1</tt></li>\n"
-> InChI="InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1"
"IUPAC Standard InChIKey:</strong>\n<tt>CIIYOYPOMGIECX-JXQTWKCFSA-N</tt>"
-> InChiKey="CIIYOYPOMGIECX-JXQTWKCFSA-N"
"Stereoisomers:....<strong>
-> stereoisomers=XXX (list of stereoisomers)
"Other names:...\n"
-> synonyms=XXX (list of synonyms)
"Normal alkane RI..."
-> list of measured RIs plus on which column they were measured
e.g. here RIs=c(1637,1631,1627,1656,1615,1638,1628,1602,1611,1635,1622,1622,1627); columns=c("HP-5 MS","DB-5","RTX-1","Col-Elite 5MS","DB-5","DB-5","DB-5","DB-1","DB-5","CP Sil 5 CB","BP-1","RTX-1","DB-5")
关于如何最好地进行后一种解析的任何想法?理想情况下,这应该全部包含在一个函数中,该函数将CAS nrs列表作为输入,使用NIST webbook中的信息对它们进行注释,并将它们写入文本文件。但是没有必要让它如此精致 - 让我开始的任何事情都会有所帮助!
编辑:我一直在尝试使用包XML中的htmlTreeParse来解析html文件,但我还没有成功。那些对这个功能有更多经验的人是否能够通过任何机会帮助我?
编辑:我已经找到了在Mathematica中导入数据的解决方案,请参阅https://mathematica.stackexchange.com/questions/37091/look-up-info-associated-with-a-given-cas-chemical-identifier-from-the-nist-webbo。如果有人愿意将该代码移植到R,请告诉我!
答案 0 :(得分:2)
对于问题中的第一个网址字符串,请尝试
casno = "19431-79-9"
url <- paste('http://webbook.nist.gov/cgi/cbook.cgi?ID=',casno,'&Units=SI&Mask=2000#Gas-Chrom', sep="")
doc <- htmlParse(url)
name <- xpathSApply(doc, "//a[@id='Top']", xmlValue)
name
[1] "Caryophylladienol II"
使用粗体标题抓取所有列表(某些输出被截断以供显示)
x <- xpathSApply(doc, "//li/strong/..", xmlValue)
x
[1] "Formula: C15H24O"
[2] "Molecular weight: 220.3505"
[3] "IUPAC Standard InChI:\n\n\nInChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2, ...
[4] "IUPAC Standard InChIKey:\nCIIYOYPOMGIECX-JXQTWKCFSA-N"
[5] "CAS Registry Number: 19431-79-9"
[6] "Chemical structure: \nThis structure is also available as a 2d Mol file\n
[7] "Species with the same structure:\nCaryophylla-4(14), 8(15)-dien-5-ol\n\n"
[8] "Stereoisomers:\nCaryophylladienol I\nCaryophylla-3(15),7(14)-dien-6-ol\n«alpha»-Caryophylladienol\nExo methylene ...
[9] "Other names:\nCaryophylla-4(14),8(15)-dien-5«alpha»-ol;\nCaryophylla-2(12),6(13)-dien-5-«alpha»-ol;\nCaryophylla ...
[10] "Information on this page:\nGas Chromatography\nReferences\nNotes / Error Report\n\n"
[11] "Options:\nSwitch to calorie-based units\n\n"
如果您只是写入文件,那么您可以修复元素8中的分隔列表(用分号替换换行符)并删除剩余的换行符。
x <- gsub(":\n", ": ", x)
x[8] <- gsub("\n+", ";", x[8])
x <- gsub("\n", "", x)
x <- gsub("Download the identifier in a file.", "", x)
对表使用readHTMLTable
y <-readHTMLTable(doc, stringsAsFactors=FALSE)
然后计算行以找到正确的表并获取值
sapply(y, nrow)
NULL NULL NULL NULL NULL NULL
1 1 5 13 6 1
y[[4]][,2:3]
Active phase I
1 HP-5 MS 1637.
2 DB-5 MS 1631.
3 RTX-1 1627.
4 Col-Elite 5MS 1656.
5 DB-5 1615.
...
ri <- paste0(gsub(".", "", y[[4]][,3], fixed=TRUE), "=", y[[4]][,2], collapse=";")
ri
[1] "1637=HP-5 MS;1631=DB-5 MS;1627=RTX-1;1656=Col-Elite 5MS;1615=DB-5;1638=DB-5;1628=DB-5;1602=DB-1;1611=DB-5;1635=CP Sil 5 CB;1622=BP-1;1622=RTX-1;1627=DB-5"
最后,合并并写入文件
cas <- c(paste("Name:", name), x[c(1:5,7:9)], paste("RI:", ri) )
write( cas, file="cas.out")
还有其他方法可以获取无序列表中的值,例如,将所有立体异构体作为向量...
stereo <- xpathSApply(doc, "//li/strong[text()='Stereoisomers:']/../ul/li/a", xmlValue)
[1] "Caryophylladienol I" "Caryophylla-3(15),7(14)-dien-6-ol" "«alpha»-Caryophylladienol"
[4] "Exo methylene isomer of Caryophyllenol I" "«beta»-Caryophylla-4(14),8(15)-dien-5-ol" "Caryophylla-4(12),8(13)-dien-5-«beta»-ol"
[7] "Caryophylla-4,8-dien-5-ol" "Caryophylla-4(12),8(13) diene 5 «beta»-ol" "Caryophyla-4(14),8(15)-dien-5-ol"
[10] "Caryophylla-4(12).8(13)-diene-5«beta»-ol" "2(12),6(13)-Caryophylladien-5-ol"
然后将多行写入文件。
paste("Stereoisomer:", stereo)