rvest&解析HTML:查找项目列表,并从每个项目中提取特定信息

时间:2017-03-15 19:43:24

标签: html r dplyr rvest

我很难找到干净的代码来执行以下操作:

  • 查找五个项目的列表
  • 迭代所有五个项目
  • 从每个项目中提取4列
  • 返回一个包含五行的数据框,每个项目一行。

示例HTML:

<div class="i-am-a-list">
    <div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
    <div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
    <div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
    <div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
    <div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>

</div>

到目前为止的代码:

# find the upper list   
coll <- read_html(doc.html) %>%
  html_node('.i-am-a-list') %>%
  html_nodes(".item")

# problems here, how do I iterate over the returned divs 
# I was expecting something like
results <- coll %>% 
    do(parse_a_single_item) %>%
    rbind_all()

是否可以编写这样漂亮的代码来完成这样的常见任务? :)

1 个答案:

答案 0 :(得分:2)

它不是很漂亮,我觉得我错过了一些明显的方法,但你可以做到:

library(rvest)
library(purrr)

read_html(x) %>%
  html_node('.i-am-a-list') %>%
  html_nodes(".item") %>% 
  map_df(~{
    class = html_attr(.x, 'class')
    a1 = html_nodes(.x, 'a') %>% '['(1) %>% html_attr('href')
    a2 = html_nodes(.x, 'a') %>% '['(2) %>% html_attr('class')
    # or with CSS selector
    # a1 = html_nodes(.x, 'a:first-child') %>% html_attr('href')
    # a2 = html_nodes(.x, 'a:nth-child(2)') %>% html_attr('class')
    p = html_nodes(.x, 'p') %>% html_text()
    data.frame(class, a1, a2, p)
    })

#             class a1          a2         p
# 1   item item-one          title sub-title
# 2   item item-two      title-two sub-title
# 3 item item-three    title-three sub-title
# 4  item item-four      title-for sub-title
# 5  item item-five     title-five sub-title

数据:

x <- '<div class="i-am-a-list">
  <div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
  <div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
  <div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
  <div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
  <div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>
</div>'