我很难找到干净的代码来执行以下操作:
示例HTML:
<div class="i-am-a-list">
<div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
<div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
<div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
<div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
<div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>
</div>
到目前为止的代码:
# find the upper list
coll <- read_html(doc.html) %>%
html_node('.i-am-a-list') %>%
html_nodes(".item")
# problems here, how do I iterate over the returned divs
# I was expecting something like
results <- coll %>%
do(parse_a_single_item) %>%
rbind_all()
是否可以编写这样漂亮的代码来完成这样的常见任务? :)
答案 0 :(得分:2)
它不是很漂亮,我觉得我错过了一些明显的方法,但你可以做到:
library(rvest)
library(purrr)
read_html(x) %>%
html_node('.i-am-a-list') %>%
html_nodes(".item") %>%
map_df(~{
class = html_attr(.x, 'class')
a1 = html_nodes(.x, 'a') %>% '['(1) %>% html_attr('href')
a2 = html_nodes(.x, 'a') %>% '['(2) %>% html_attr('class')
# or with CSS selector
# a1 = html_nodes(.x, 'a:first-child') %>% html_attr('href')
# a2 = html_nodes(.x, 'a:nth-child(2)') %>% html_attr('class')
p = html_nodes(.x, 'p') %>% html_text()
data.frame(class, a1, a2, p)
})
# class a1 a2 p
# 1 item item-one title sub-title
# 2 item item-two title-two sub-title
# 3 item item-three title-three sub-title
# 4 item item-four title-for sub-title
# 5 item item-five title-five sub-title
数据:
x <- '<div class="i-am-a-list">
<div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
<div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
<div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
<div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
<div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>
</div>'