在R中使用rvest时替换丢失的html_nodes

时间:2017-06-02 19:40:29

标签: html r html-parsing rvest

我正在尝试解析一个内部有几个li元素的html。这只是我用两个div保存的示例html。我有近7000个div被解析。并非所有的div都包含其中的所有li元素。例如。 <li class="brewery_type">可能并非在所有div中都可用。由于此代码将无法将所有值填充到tibble中。在这种情况下,我如何仍然解析这个并用NA替换该div中缺少的li元素。

library(rvest)
library(dplyr)

html_file <- '<!DOCTYPE html>
<html>

<head>
    <title>Page Title</title>
</head>

<body>
    <div class="brewery" id="brewery">
        <ul class="vcard simple">
            <li class="name"> Bradley Farm / RB Brew, LLC</li>
            <li class="address">317 Springtown Rd </li>
            <li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li>
            <li class="telephone">Phone: (845) 255-8769</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
    <div class="brewery">
        <ul class="vcard simple">
            <li class="name">(405) Brewing Co</li>
            <li class="address">1716 Topeka St </li>
            <li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li>
            <li class="telephone">Phone: (405) 816-0490</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
</body>'

page <- read_html(html_file) 

tibble(
  name = page %>% html_nodes(".vcard .name") %>% html_text(),
  address = page %>% html_nodes(".vcard .address") %>% html_text(),
  type = page %>% html_nodes(".vcard .brewery_type") %>% html_text() %>% stringr::str_replace_all("^Type: ", ""),
  website = page %>% html_nodes(".vcard .url a") %>% html_attr("href")
)

1 个答案:

答案 0 :(得分:1)

我没有在一次传递中解析所有标记,而是将div.brewery解析为元素/节点列表,然后分别从每个啤酒厂中提取所请求的信息。效率不高,但它跟踪每个父母的相关信息。该模型假设每个父项只有一个子元素。因此每个div.brewery只有一个名称,地址和网站

library(rvest)

html_file <- '<!DOCTYPE html>
<html>  
<head>
<title>Page Title</title>
</head>

<body>
<div class="brewery" id="brewery">
<ul class="vcard simple">
<li class="name"> Bradley Farm / RB Brew, LLC</li>
<li class="address">317 Springtown Rd </li>
<li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li>
<li class="telephone">Phone: (845) 255-8769</li>
<li class="brewery_type">Type: Micro</li>
<li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
<div class="brewery">
<ul class="vcard simple">
<li class="name">(405) Brewing Co</li>
<li class="address">1716 Topeka St </li>
<li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li>
<li class="telephone">Phone: (405) 816-0490</li>

<li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
</body>'

page <- read_html(html_file) 

breweries<-page %>% html_nodes("div.brewery")

name<- breweries %>% html_node(".vcard .name") %>% html_text()
address<- breweries %>% html_node(".vcard .address") %>% html_text()
type<- breweries %>% html_node(".vcard .brewery_type") %>% html_text()
type<-gsub("^Type: ", "", type)
website<- breweries %>% html_node(".vcard .url a") %>% html_text()

tibble(name, address, type, website)