我正在研究drugbank数据库,请从下面的HTML代码中提取特定文本,我需要帮助:
<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>
我希望将以下内容作为输出文本作为列表对象:
B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS
我尝试了以下功能,但它不起作用:
library(XML)
getATC <- function(id){
url <- "http://www.drugbank.ca/drugs/"
dburl <- paste(url, id, sep ="")
tables <- readHTMLTable(dburl, header = F)
table <- tables[['atc-drug-tree']]
table
}
ids <- c("DB00208", "DB00209")
ref <- apply(ids, 1, getATC)
NB: 该url可用于查看我想要解析的实际页面,我提供的HTML代码段只是示例。
由于
答案 0 :(得分:3)
rvest
使网页抓取变得非常简单。这是使用它的解决方案。
library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>')
your_name <-
your_html %>%
html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
html_text() %>%
str_extract(".+(?=\n)")
list_elements <-
your_html %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC — Platelet aggregation inhibitors excl. heparin"
[2] "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS"
[4] "B — BLOOD AND BLOOD FORMING ORGANS"
答案 1 :(得分:2)
使用解析HTML的sapply
函数创建URL字符串和getDrugs
,提取HTML树的根,找到具有指定类的ul
节点并返回其父文本的文本(但仅在第一个空格之前)后跟每个./li/a
孙子中的文本:
library(XML)
getDrugs <- function(...) {
doc <- htmlTreeParse(..., useInternalNodes = TRUE)
xpathApply(xmlRoot(doc), "//ul[@class='atc-drug-tree']", function(node) {
c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
})
}
ids <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)
给出以下列表(每个URL一个组件,每个组件中的一个组件,用于该URL中的每种药物):
> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"
$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"
[5] "A — ALIMENTARY TRACT AND METABOLISM"
$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"
[4] "G04 — UROLOGICALS"
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"
我们可以像这样创建一个5x3矩阵:
simplify2array(do.call(c, L))
这是一个使用问题输入的测试:
Lines <- '<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>'
getDrugs(Lines, asText = TRUE)
,并提供:
[[1]]
[1] "B01AC05"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"
答案 2 :(得分:0)
readHTMLTable
无效,因为它无法读取表3和表4中的标题。
url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts
td a tr li th span div p strong img table ...
745 399 342 175 159 137 66 49 46 27 27
#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)
# this works
readHTMLTable(doc, which=3, header=FALSE)
此外,ATC代码不在附近的表格标签内,因此您必须像其他答案一样使用xpath。
xpathSApply(doc, '//ul[@class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS" "B — BLOOD AND BLOOD FORMING ORGANS"
xpathSApply(doc, '//ul[@class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"