我正在尝试使用RCurl从this website抓取此表格。我能够使用代码将其放入一个漂亮的数据框中:
clinVar <- getURL("http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[@class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq", "Phenotype","Clinical significance","Status", "Chr","Location")
但是,我只能在第一页上提取数据,而且表格跨越多个页面。你如何访问下一页的数据?我查看了网站的HTML代码以及“Next”按钮所在的区域(我相信!):
<a name="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page">Next ></a>
我想知道如何使用getURL
,postForm
等来访问此链接。我想我应该做这样的事情,从第二页获取数据,但它仍然只是给我第一页:
url <- "http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1"
clinVar <- postForm(url,
"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.cPage" ="2")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[@class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq","Phenotype","Clinical significance","Status", "Chr","Location")
感谢任何可以提供帮助的人。
答案 0 :(得分:3)
我会使用电子实用程序来访问NCBI的数据。
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=brca1"
readLines(url)
[1] "<?xml version=\"1.0\" ?>"
[2] "<!DOCTYPE eSearchResult PUBLIC \"-//NLM//DTD eSearchResult, 11 May 2002//EN\" \"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd\">"
[3] "<eSearchResult><Count>1080</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_36649974_130.14.18.34_9001_1386348760_356908530</WebEnv><IdList>"
将QueryKey和WebEnv传递给esummary并获取XML摘要(这会随每个esearch的变化而变化,因此将新密钥复制并粘贴到下面的url中)
url2 <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&query_key=1&WebEnv=NCID_1_36649974_130.14.18.34_9001_1386348760_356908530"
brca1 <- xmlParse(url2)
接下来,查看单个记录,然后提取所需的字段。如果为标记分配了0到多个值,则可能需要循环遍历该集合。其他像临床意义描述总是有1个值。
getNodeSet(brca1, "//DocumentSummary")[[1]]
table(xpathSApply(brca1, "//clinical_significance/description", xmlValue) )
Benign conflicting data from submitters not provided other
129 22 6 1
Pathogenic probably not pathogenic probably pathogenic risk factor
508 68 19 43
Uncertain significance
284
此外,github和BioC上还有许多包含E-utilities的软件包(rentrez,reutils,基因组等)。使用BioC上的基因组包,这简化为
brca1 <- esummary( esearch("brca1", db="clinvar"), parse=FALSE )
答案 1 :(得分:0)
使用NCBI数据库上的电子实用程序功能,有关详细信息,请参阅http://www.ncbi.nlm.nih.gov/books/NBK25500/。
## use eSearch feature in eUtilities to search NCBI for ids corresponding to each row of data.
## note to see all ids, not not just top 10 set retmax to a high number
## to get query id and web env info, set usehistory=y
library(RCurl)
library(XML)
baseSearch <- ("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=") ## eSearch
db <- "clinvar" ## database to query
gene <- "BRCA1" ## gene of interest
query <- paste('[gene]+AND+"','clinsig pathogenic"','[Properties]+AND+"','single nucleotide variant"','[Type of variation]&usehistory=y&retmax=1110',sep="") ## query, see below for details
baseFetch <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=" ## base fetch
searchURL <- paste(baseSearch,db, "&term=",gene,query,sep="")
getSearch <- getURL(searchURL)
searchHTML <- htmlTreeParse(searchURL, useInternalNodes =T)
nodes <- getNodeSet(searchHTML,"//querykey") ## this name "querykey" was extracted from the HTML source code for this page
querykey <- xmlToDataFrame(nodes)
nodes <- getNodeSet(searchHTML,"//webenv") ## this name "webenv" was extracted from the HTML source code for this page
webenv <- xmlToDataFrame(nodes)
fetchURL <- paste(baseFetch,db,"&query_key=",querykey,"&WebEnv=",webenv[[1]],"&rettype=docsum",sep="")
getFetch <- getURL(fetchURL)
fetchHTML <- htmlTreeParse(getFetch, useInternalNodes =T)
nodes <- getNodeSet(fetchHTML, "//position")
extractedDataAll <- xmlToDataFrame(nodes)
colnames(extractedDataAll) <- c("pathogenicSNPs")
print(extractedDataAll)
请注意,我通过http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1选择我的过滤器(致病等)然后点击高级按钮找到了查询信息。应用的最新过滤器应该出现在主框中,我将其用于查询。
答案 2 :(得分:0)
ClinVar现在提供整个数据库的XML下载,因此不需要webscraping。