我有一个描述列表,我从一个带有议程的网站下载,我试图创建一个data.frame但没有成功。 描述列表具有以下结构:
<dl>
<dt> (which contains a <p = "day"> for day)
<dd> (which contains a <p = "hour"> for hour and a <p = "event"> for the event)
我设法使用以下代码提取此数据:
library(rvest)
url <- read_html("www.mypage.com")
day <- data.frame(day = html_text(html_nodes(url, '.day')))
hour <- data.frame(hour = html_text(html_nodes(url, '.hour')))
event <- data.frame(event = html_text(html_nodes(url, '.event')))
day$ID <- seq.int(nrow(day))
hour$ID <- seq.int(nrow(hour))
event$ID <- seq.int(nrow(event))
然后我通过加入BY ID创建了一个数据框。
问题在于我有这个:
<dl>
<dt>
<dd>
<dd>
<dd>
每天不止一件事。
我如何创建我的data.frame,考虑到同一<dd>
我可能有多个<dt>
?谢谢!
答案 0 :(得分:2)
dl
/ dt
/ dd
抓取是“为什么HTML创建者对我们这样做”的其中一种。这个世界可以得到你想要的东西:
library(rvest)
library(tidyverse)
pg <- read_html("http://www.presidencia.pt/?idc=11&fano=2016")
# grab ALL the dt/dd elements under each dl into one big node list
entries <- html_nodes(pg, xpath=".//dl[@id='ms_agend3']/*")
# this finds all of the "dt" elements
starts <- which(xml_name(entries) == "dt")
# this tells us where ^^ "dd"'s stop
ends <- c(starts[-1]-1, length(entries))
# it took 30s for me, so progress bars make the time pass visually
pb <- progress_estimated(length(starts))
# now we iterate over the start/end pairs
map2_df(starts, ends, ~{
pb$tick()$print() # tick off the progress bar
# we're only going to work on the part of the node list for this dt/dd set
start <- .x
end <- .y
# get the day
dt <- html_text(entries[start], trim=TRUE)
# now iterate over each associated dd and pull out the info
map_df((start+1):end, ~{
data_frame(
hour = html_text(html_node(entries[.x], "div.hora"), trim=TRUE),
text = html_text(html_node(entries[.x], "div.texto"), trim=TRUE),
)
}) %>%
mutate(day = dt) # add the day in
}) %>%
select(day, hour, text) -> agenda # rearrange and store
由于它制作数据帧的方式有点慢,但它会捕获议程的日/小时/文本(包括我假设的空白时间是信息或全天事件)。
此:
pb <- progress_estimated(length(starts))
map2_df(starts, ends, ~{
pb$tick()$print()
start <- .x
end <- .y
data_frame(
hour = html_text(html_nodes(entries[(start+1):end], "div.hora"), trim=TRUE),
text = html_text(html_nodes(entries[(start+1):end], "div.texto"), trim=TRUE),
day = html_text(entries[start], trim=TRUE)
)
}) %>%
select(day, hour, text) -> agenda
有点快,并且只要我的眼睛告诉我就会产生相同的结果。