Question

我很难找到干净的代码来执行以下操作：

查找五个项目的列表
迭代所有五个项目
从每个项目中提取4列
返回一个包含五行的数据框，每个项目一行。

示例HTML：

<div class="i-am-a-list">
    <div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
    <div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
    <div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
    <div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
    <div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>

</div>

到目前为止的代码：

# find the upper list   
coll <- read_html(doc.html) %>%
  html_node('.i-am-a-list') %>%
  html_nodes(".item")

# problems here, how do I iterate over the returned divs 
# I was expecting something like
results <- coll %>% 
    do(parse_a_single_item) %>%
    rbind_all()

是否可以编写这样漂亮的代码来完成这样的常见任务？：）

Answer 1

它不是很漂亮，我觉得我错过了一些明显的方法，但你可以做到：

library(rvest)
library(purrr)

read_html(x) %>%
  html_node('.i-am-a-list') %>%
  html_nodes(".item") %>% 
  map_df(~{
    class = html_attr(.x, 'class')
    a1 = html_nodes(.x, 'a') %>% '['(1) %>% html_attr('href')
    a2 = html_nodes(.x, 'a') %>% '['(2) %>% html_attr('class')
    # or with CSS selector
    # a1 = html_nodes(.x, 'a:first-child') %>% html_attr('href')
    # a2 = html_nodes(.x, 'a:nth-child(2)') %>% html_attr('class')
    p = html_nodes(.x, 'p') %>% html_text()
    data.frame(class, a1, a2, p)
    })

#             class a1          a2         p
# 1   item item-one          title sub-title
# 2   item item-two      title-two sub-title
# 3 item item-three    title-three sub-title
# 4  item item-four      title-for sub-title
# 5  item item-five     title-five sub-title

数据：

x <- '<div class="i-am-a-list">
  <div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
  <div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
  <div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
  <div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
  <div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>
</div>'

rvest＆amp;解析HTML：查找项目列表，并从每个项目中提取特定信息

1 个答案: