Question

我有一个比较复杂的任务，需要查找数据框中包含的一系列URL，从每个URL中抓取一些数据，然后再将这些数据添加回原始数据框中。我似乎以某种方式解决了其中最困难的部分（抓取部分），但是我在如何自动执行任务方面遇到了问题（我怀疑这可能很简单）。

情况是这样的：我有一个由12个变量和44,000行组成的data.frame。这些变量之一Programme_Synopsis_url包含BBC iPlayer上程序的URL。

我需要转到该URL，提取一条数据（通道的详细信息），然后将其添加到名为Channel的新列中。

以下是一些示例数据（对于此示例的大小/复杂性，我深表歉意，但我认为有必要共享此示例，以便获得正确的解决方案）：

df <- structure(list(Title = structure(c(3L, 7L, 5L, 2L, 6L, 6L, 1L, 
4L, 9L, 8L), .Label = c("Asian Provocateur", "Cuckoo", "Dragons' Den", 
"In The Flesh", "Keeping Faith", "Lost Boys? What's Going Wrong For Asian Men", 
"One Hot Summer", "Travels in Trumpland with Ed Balls", "Two Pints of Lager and a Packet of Crisps"
), class = "factor"), Series = structure(c(1L, 1L, 1L, 3L, 1L, 
1L, 2L, 2L, 1L, 1L), .Label = c("", "Series 1-2", "Series 4"), class = "factor"), 
    Programme_Synopsis = structure(c(2L, 5L, 4L, 6L, 1L, 1L, 
    8L, 7L, 9L, 3L), .Label = c("", "1. The Dragons are back - with big money on the table.", 
    "1/3 Proud. Meeting rednecks", "1/8 Faith questions everything when her husband goes missing", 
    "4/6 What Happens in Ibiza... Is Megan really a party animal?", 
    "Box Set. Dale plans to propose – but what does Ken think?", 
    "Box Set. For the undead... life begins again", "Box Set. Romesh... and mum", 
    "Series 1-9. Box Set"), class = "factor"), Programme_Synopsis_url = structure(c(6L, 
    9L, 4L, 8L, 1L, 1L, 3L, 7L, 2L, 5L), .Label = c("", "https://www.bbc.co.uk/iplayer/episode/b00747zt/two-pints-of-lager-and-a-packet-of-crisps-series-1-1-fags-shags-and-kebabs", 
    "https://www.bbc.co.uk/iplayer/episode/b06fq3x4/asian-provocateur-series-1-1-uncle-thiru", 
    "https://www.bbc.co.uk/iplayer/episode/b09rjsq5/keeping-faith-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/b0bdpvhf/travels-in-trumpland-with-ed-balls-series-1-1-proud", 
    "https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p00szzcp/in-the-flesh-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p06f52g1/cuckoo-series-4-1-lawyer-of-the-year", 
    "https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza"
    ), class = "factor"), Programme_Duration = structure(c(6L, 
    4L, 6L, 1L, 6L, 6L, 2L, 5L, 3L, 6L), .Label = c("25 mins", 
    "28 mins", "29 mins", "40 mins", "56 mins", "59 mins"), class = "factor"), 
    Programme_Availability = structure(c(4L, 2L, 1L, 6L, 4L, 
    4L, 5L, 6L, 5L, 3L), .Label = c("Available for 1 month", 
    "Available for 11 months", "Available for 17 days", "Available for 28 days", 
    "Available for 3 months", "Available for 5 months"), class = "factor"), 
    Programme_Category = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 
    1L, 1L, 1L, 1L), .Label = c("Box Sets", "Featured", "Most Popular"
    ), class = "factor"), Programme_Genre = structure(c(4L, 2L, 
    3L, 5L, 2L, 2L, 1L, 3L, 1L, 2L), .Label = c("Comedy", "Documentary", 
    "Drama", "Entertainment", "New SeriesComedy"), class = "factor"), 
    date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = "13/08/2018", class = "factor"), rank = c(1L, 
    2L, 3L, 4L, 5L, 12L, 1L, 2L, 3L, 4L), row = c(1L, 1L, 1L, 
    1L, 1L, 3L, 4L, 4L, 4L, 4L), Box_Set = structure(c(1L, 1L, 
    1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

只是使事情变得更加复杂（！），所以有两种不同类型的URL。有些指向程序的情节页面，有些指向主程序页面（URL语法没有区别，以便区分两者）。这很重要的原因是因为要抓取的数据（频道名称）位于不同位置，具体取决于它是剧集页面还是节目的主页。我编写了一个脚本来获取以下每种页面类型的数据：

### Get Channel for programme page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% html_text()

### Confirm Channel details ###
print(channel)


### Get Channel for episode page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% html_text()

### Confirm Channel details ###
print(channel)

问题是，我该如何自动执行此操作，并遍历每个URL（大约44,000个），提取此数据，然后将其添加到名为Channel的新列中？

几个最终的关注/警告/问题：

从44,000个URL中查找和抓取数据会引起任何技术问题吗？我不想杀死BBC的服务器，也不想因此而使我的IP被阻止！我已经检查了他们网站的条款和条件，没有发现我发现的抓取内容。
可能需要指出的是，尽管我需要检查大约44,000行（URL），但其中许多是重复的。因此，我想开始创建一个删除所有重复项的新数据框是否会更好（例如，基于Programme_Synopsis_url或Title列）。这样做意味着我将需要抓取数量少得多的URL，然后可以将这些数据合并回原始数据框中。即如果Title匹配，则将精简数据帧的Channel列中的变量添加到原始数据帧中名为Channel的列中。
我想我将不得不使用带有if / else语句的某种循环来执行此操作。即如果URL包含某个xpath，然后将该数据复制并粘贴到该行的Channel列中，则ELSE从另一个xpath复制数据并将其输入到该行的Channel列中。如果页面不包含xpath（可能的话），则什么也不做。

希望一切都清楚。很高兴在必要时进行详细说明。

编辑：更新了上面代码中错误的URL之一。

Answer 1

您可以通过以下方法轻松实现这一目标：

创建抓取零件的功能。
在此函数中，您尝试第一个Xpath，如果结果为空，则尝试第二个Xpath
您可以使用任何形式的循环对所有URL重复此任务。（我使用了purrr::map，但是任何循环都可以）

library(rvest)

get_channel <- function(url) {
   ## some elements do not contain any url
   if (!nchar(url)) return(NA_character_)
   page <- url %>%
    read_html()
   ## try to read channel
   channel <- page %>% 
     html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% 
     html_text()
   ## if it's empty we are most likely on an episode page -> try the other xpath 
   if (!length(channel)) {
    channel <- page %>% 
       html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% 
       html_text()
   }
   ifelse(length(channel), channel, NA_character_)
}

## loop through all urls in the df

purrr::map_chr(as.character(df$Programme_Synopsis_url), get_channel)
# [1] "BBC Two"   "BBC Three" "BBC Three" "BBC Three" NA          NA          "BBC Three" "BBC Three" "BBC Three" "BBC Two"

对于您的其他问题：

BBC可能试图阻止您抓取他们的页面。有一些技巧可以解决此问题，例如在连续请求之间增加延迟。有时，网页会寻找用户代理，因此您需要更改每个n请求，以使网站不会阻止您。网站如何尝试保护自己免受网络抓取的方式有多种，这取决于具体情况，您需要做什么。话虽如此，我不认为44k请求甚至会终止他们的服务，但我不是这里的专家。

避免请求重复的URL绝对有意义，这可以通过[unested]轻松实现：

new_df <- df[!duplicated(df$Programme_Synopsis_url), ]
new_df$channel <- purrr::map_chr(as.character(new_df$Programme_Synopsis_url), 
                                 get_channel)
dplyr::left_join(df, 
                 new_df[, c("Programme_Synopsis_url", "channel")], 
                 by = "Programme_Synopsis_url")

从R中的数据框中的URL列表进行Web爬网

1 个答案: