我试图在此链接的末尾获得数字范围:https://schedule.sxsw.com/2019/speakers/2008434
。
链接的末尾有一个数字,例如2008434
。这些链接指的是即将举行的“西南西南音乐节”演讲者的简历。我知道总共有3729位演讲者,但这无助于我弄清楚每个演讲者及其相关页面的编号方式。
我正在尝试使用lapply
函数进行一些简单的网络抓取,但是当我无法指定范围时,我的函数将无法正常工作。例如,我使用了:
number_range <- seq(1:3000000)
在链接周围单击不会显示任何编号方式。
我有很多Error in open.connection(x, "rb") : HTTP error 404.
是否有一种简单的方法来获得该范围/使该功能起作用?下面的代码:
library(rvest)
library(tidyverse)
# List for bios
sxsw_bios <- list()
# Creating vector of numbers
number_range <- seq(1:3000000)
# Scraping bios with names
sxsw_bios <- lapply(number_range, function(y) {
# Getting speaker name
Name <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/",
paste0(y))) %>%
html_nodes(".speaker-name") %>%
html_text()
答案 0 :(得分:2)
您可以从发言人页面中抓取ID列表
library(rvest)
ids <- lapply( letters, function(x) {
speakers <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/alpha/", x)) %>%
rvest::html_nodes(xpath = "//*[@class='favorite-click absolute']/@data-item-id")
speakers <- gsub(' data-item-id="|"',"",speakers)
speakers
})
然后在您的代码中使用这些ID。 (在此示例中,我仅执行前5个操作)
ids <- unlist(ids)
# Scraping bios with names
sxsw_bios <- lapply(ids[1:5], function(y) {
doc <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", y))
# Getting speaker name
Name <- doc %>%
html_nodes(".speaker-name") %>%
html_text()
bio <- doc %>%
html_nodes(xpath = "//*[@class='row speaker-bio']") %>%
html_text()
list(name= Name, bio = bio)
})
sxsw_bios[[1]]
$name
# [1] "A$AP Rocky"
$bio
# [1] "A$AP Rocky is a cultural beacon that continues to ... <etc>
# ------------
sxsw_bios[[5]]
# $name
# [1] "Ken Abdo"
#
# $bio
# [1] "Ken Abdo is a partner at the national law firm of Fox Rothschild...<etc>