从R中的HTML页面中提取文本

时间:2016-03-14 12:02:10

标签: html xml r

我正在研究drugbank数据库,请从下面的HTML代码中提取特定文本,我需要帮助:

<table>
<tr>
    <td>Text</td>
</tr>
<tr>
    <th>ATC Codes</th>
    <td>B01AC05
        <ul class="atc-drug-tree">
            <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
        </ul>
    </td>
</tr>
<tr>
    <td>Text</td>
</tr>
</table>

我希望将以下内容作为输出文本作为列表对象:

B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS

我尝试了以下功能,但它不起作用:

library(XML)

getATC <- function(id){
    url    <- "http://www.drugbank.ca/drugs/"
    dburl  <- paste(url, id, sep ="")
    tables <- readHTMLTable(dburl, header = F)
    table  <- tables[['atc-drug-tree']]
    table
}

ids  <- c("DB00208", "DB00209")
ref  <- apply(ids, 1, getATC)

NB: 该url可用于查看我想要解析的实际页面,我提供的HTML代码段只是示例。

由于

3 个答案:

答案 0 :(得分:3)

rvest使网页抓取变得非常简单。这是使用它的解决方案。

library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
          <td>Text</td>
          </tr>
          <tr>
          <th>ATC Codes</th>
          <td>B01AC05
          <ul class="atc-drug-tree">
          <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
          <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
          <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
          <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
          </ul>
          </td>
          </tr>
          <tr>
          <td>Text</td>
          </tr>
          </table>')
your_name <- 
  your_html %>% 
  html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
  html_text() %>%
  str_extract(".+(?=\n)")
list_elements <- 
  your_html %>%  html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC — Platelet aggregation inhibitors excl. heparin"
[2] "B01A — ANTITHROMBOTIC AGENTS"                         
[3] "B01 — ANTITHROMBOTIC AGENTS"                          
[4] "B — BLOOD AND BLOOD FORMING ORGANS"        

答案 1 :(得分:2)

使用解析HTML的sapply函数创建URL字符串和getDrugs,提取HTML树的根,找到具有指定类的ul节点并返回其父文本的文本(但仅在第一个空格之前)后跟每个./li/a孙子中的文本:

library(XML)

getDrugs <- function(...) {
   doc <- htmlTreeParse(..., useInternalNodes = TRUE)
   xpathApply(xmlRoot(doc), "//ul[@class='atc-drug-tree']", function(node) {
     c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
     xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
   })
}


ids  <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)

给出以下列表(每个URL一个组件,每个组件中的一个组件,用于该URL中的每种药物):

> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"                                         
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"                         
[4] "B01 — ANTITHROMBOTIC AGENTS"                          
[5] "B — BLOOD AND BLOOD FORMING ORGANS"                   


$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"                                                           
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"                   
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"                  
[5] "A — ALIMENTARY TRACT AND METABOLISM"                                    

$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"                                        
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"                                  
[4] "G04 — UROLOGICALS"                                   
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"          

我们可以像这样创建一个5x3矩阵:

simplify2array(do.call(c, L))

这是一个使用问题输入的测试:

Lines <- '<table>
<tr>
    <td>Text</td>
</tr>
<tr>
    <th>ATC Codes</th>
    <td>B01AC05
        <ul class="atc-drug-tree">
            <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
        </ul>
    </td>
</tr>
<tr>
    <td>Text</td>
</tr>
</table>'

getDrugs(Lines, asText = TRUE)

,并提供:

[[1]]
[1] "B01AC05"                                               
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"                         
[4] "B01 — ANTITHROMBOTIC AGENTS"                          
[5] "B — BLOOD AND BLOOD FORMING ORGANS"    

答案 2 :(得分:0)

readHTMLTable无效,因为它无法读取表3和表4中的标题。

url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts

      td        a       tr       li       th     span      div        p   strong      img    table ...
     745      399      342      175      159      137       66       49       46       27       27  

#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)   
# this works
readHTMLTable(doc, which=3, header=FALSE)

此外,ATC代码不在附近的表格标签内,因此您必须像其他答案一样使用xpath。

xpathSApply(doc, '//ul[@class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"                         
[3] "B01 — ANTITHROMBOTIC AGENTS"                           "B — BLOOD AND BLOOD FORMING ORGANS"     

xpathSApply(doc, '//ul[@class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"