我有一个比较复杂的任务,需要查找数据框中包含的一系列URL,从每个URL中抓取一些数据,然后再将这些数据添加回原始数据框中。我似乎以某种方式解决了其中最困难的部分(抓取部分),但是我在如何自动执行任务方面遇到了问题(我怀疑这可能很简单)。
情况是这样的:我有一个由12个变量和44,000行组成的data.frame。这些变量之一Programme_Synopsis_url
包含BBC iPlayer上程序的URL。
我需要转到该URL,提取一条数据(通道的详细信息),然后将其添加到名为Channel
的新列中。
以下是一些示例数据(对于此示例的大小/复杂性,我深表歉意,但我认为有必要共享此示例,以便获得正确的解决方案):
df <- structure(list(Title = structure(c(3L, 7L, 5L, 2L, 6L, 6L, 1L,
4L, 9L, 8L), .Label = c("Asian Provocateur", "Cuckoo", "Dragons' Den",
"In The Flesh", "Keeping Faith", "Lost Boys? What's Going Wrong For Asian Men",
"One Hot Summer", "Travels in Trumpland with Ed Balls", "Two Pints of Lager and a Packet of Crisps"
), class = "factor"), Series = structure(c(1L, 1L, 1L, 3L, 1L,
1L, 2L, 2L, 1L, 1L), .Label = c("", "Series 1-2", "Series 4"), class = "factor"),
Programme_Synopsis = structure(c(2L, 5L, 4L, 6L, 1L, 1L,
8L, 7L, 9L, 3L), .Label = c("", "1. The Dragons are back - with big money on the table.",
"1/3 Proud. Meeting rednecks", "1/8 Faith questions everything when her husband goes missing",
"4/6 What Happens in Ibiza... Is Megan really a party animal?",
"Box Set. Dale plans to propose – but what does Ken think?",
"Box Set. For the undead... life begins again", "Box Set. Romesh... and mum",
"Series 1-9. Box Set"), class = "factor"), Programme_Synopsis_url = structure(c(6L,
9L, 4L, 8L, 1L, 1L, 3L, 7L, 2L, 5L), .Label = c("", "https://www.bbc.co.uk/iplayer/episode/b00747zt/two-pints-of-lager-and-a-packet-of-crisps-series-1-1-fags-shags-and-kebabs",
"https://www.bbc.co.uk/iplayer/episode/b06fq3x4/asian-provocateur-series-1-1-uncle-thiru",
"https://www.bbc.co.uk/iplayer/episode/b09rjsq5/keeping-faith-series-1-episode-1",
"https://www.bbc.co.uk/iplayer/episode/b0bdpvhf/travels-in-trumpland-with-ed-balls-series-1-1-proud",
"https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1",
"https://www.bbc.co.uk/iplayer/episode/p00szzcp/in-the-flesh-series-1-episode-1",
"https://www.bbc.co.uk/iplayer/episode/p06f52g1/cuckoo-series-4-1-lawyer-of-the-year",
"https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza"
), class = "factor"), Programme_Duration = structure(c(6L,
4L, 6L, 1L, 6L, 6L, 2L, 5L, 3L, 6L), .Label = c("25 mins",
"28 mins", "29 mins", "40 mins", "56 mins", "59 mins"), class = "factor"),
Programme_Availability = structure(c(4L, 2L, 1L, 6L, 4L,
4L, 5L, 6L, 5L, 3L), .Label = c("Available for 1 month",
"Available for 11 months", "Available for 17 days", "Available for 28 days",
"Available for 3 months", "Available for 5 months"), class = "factor"),
Programme_Category = structure(c(2L, 2L, 2L, 2L, 2L, 3L,
1L, 1L, 1L, 1L), .Label = c("Box Sets", "Featured", "Most Popular"
), class = "factor"), Programme_Genre = structure(c(4L, 2L,
3L, 5L, 2L, 2L, 1L, 3L, 1L, 2L), .Label = c("Comedy", "Documentary",
"Drama", "Entertainment", "New SeriesComedy"), class = "factor"),
date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = "13/08/2018", class = "factor"), rank = c(1L,
2L, 3L, 4L, 5L, 12L, 1L, 2L, 3L, 4L), row = c(1L, 1L, 1L,
1L, 1L, 3L, 4L, 4L, 4L, 4L), Box_Set = structure(c(1L, 1L,
1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
只是使事情变得更加复杂(!),所以有两种不同类型的URL。有些指向程序的情节页面,有些指向主程序页面(URL语法没有区别,以便区分两者)。这很重要的原因是因为要抓取的数据(频道名称)位于不同位置,具体取决于它是剧集页面还是节目的主页。我编写了一个脚本来获取以下每种页面类型的数据:
### Get Channel for programme page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1'
### Then, locate details of Channel via xpath ###
channel <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% html_text()
### Confirm Channel details ###
print(channel)
### Get Channel for episode page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza'
### Then, locate details of Channel via xpath ###
channel <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% html_text()
### Confirm Channel details ###
print(channel)
问题是,我该如何自动执行此操作,并遍历每个URL(大约44,000个),提取此数据,然后将其添加到名为Channel
的新列中?
几个最终的关注/警告/问题:
Programme_Synopsis_url
或Title
列)。这样做意味着我将需要抓取数量少得多的URL,然后可以将这些数据合并回原始数据框中。即如果Title
匹配,则将精简数据帧的Channel
列中的变量添加到原始数据帧中名为Channel
的列中。Channel
列中,则ELSE从另一个xpath复制数据并将其输入到该行的Channel
列中。如果页面不包含xpath(可能的话),则什么也不做。希望一切都清楚。很高兴在必要时进行详细说明。
编辑:更新了上面代码中错误的URL之一。
答案 0 :(得分:1)
您可以通过以下方法轻松实现这一目标:
purrr::map
,但是任何循环都可以)library(rvest)
get_channel <- function(url) {
## some elements do not contain any url
if (!nchar(url)) return(NA_character_)
page <- url %>%
read_html()
## try to read channel
channel <- page %>%
html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>%
html_text()
## if it's empty we are most likely on an episode page -> try the other xpath
if (!length(channel)) {
channel <- page %>%
html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>%
html_text()
}
ifelse(length(channel), channel, NA_character_)
}
## loop through all urls in the df
purrr::map_chr(as.character(df$Programme_Synopsis_url), get_channel)
# [1] "BBC Two" "BBC Three" "BBC Three" "BBC Three" NA NA "BBC Three" "BBC Three" "BBC Three" "BBC Two"
对于您的其他问题:
n
请求,以使网站不会阻止您。网站如何尝试保护自己免受网络抓取的方式有多种,这取决于具体情况,您需要做什么。话虽如此,我不认为44k请求甚至会终止他们的服务,但我不是这里的专家。避免请求重复的URL绝对有意义,这可以通过[unested]轻松实现:
new_df <- df[!duplicated(df$Programme_Synopsis_url), ]
new_df$channel <- purrr::map_chr(as.character(new_df$Programme_Synopsis_url),
get_channel)
dplyr::left_join(df,
new_df[, c("Programme_Synopsis_url", "channel")],
by = "Programme_Synopsis_url")