在R中URL末尾查找数字范围

时间:2019-02-11 01:38:20

标签: r lapply

我试图在此链接的末尾获得数字范围:https://schedule.sxsw.com/2019/speakers/2008434

链接的末尾有一个数字,例如2008434。这些链接指的是即将举行的“西南西南音乐节”演讲者的简历。我知道总共有3729位演讲者,但这无助于我弄清楚每个演讲者及其相关页面的编号方式。

我正在尝试使用lapply函数进行一些简单的网络抓取,但是当我无法指定范围时,我的函数将无法正常工作。例如,我使用了:

number_range <- seq(1:3000000)

在链接周围单击不会显示任何编号方式。

我有很多Error in open.connection(x, "rb") : HTTP error 404.

是否有一种简单的方法来获得该范围/使该功能起作用?下面的代码:

library(rvest)
library(tidyverse)

# List for bios
sxsw_bios <- list()

# Creating vector of numbers
number_range <- seq(1:3000000)

# Scraping bios with names
sxsw_bios <- lapply(number_range, function(y) {

# Getting speaker name
Name <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", 
                       paste0(y))) %>% 
  html_nodes(".speaker-name") %>% 
  html_text()

1 个答案:

答案 0 :(得分:2)

您可以从发言人页面中抓取ID列表

library(rvest)

ids <- lapply( letters, function(x) {
  speakers <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/alpha/", x)) %>%
    rvest::html_nodes(xpath = "//*[@class='favorite-click absolute']/@data-item-id")

  speakers <- gsub(' data-item-id="|"',"",speakers)
  speakers
})

然后在您的代码中使用这些ID。 (在此示例中,我仅执行前5个操作)

ids <- unlist(ids)

# Scraping bios with names
sxsw_bios <- lapply(ids[1:5], function(y) {

    doc <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", y))

  # Getting speaker name
  Name <- doc %>% 
    html_nodes(".speaker-name") %>% 
    html_text()

  bio <- doc %>%
    html_nodes(xpath = "//*[@class='row speaker-bio']") %>%
    html_text()
  list(name= Name, bio = bio)
})

sxsw_bios[[1]]

$name
# [1] "A$AP Rocky"

$bio
# [1] "A$AP Rocky is a cultural beacon that continues to ... <etc>

# ------------

sxsw_bios[[5]]

# $name
# [1] "Ken Abdo"
# 
# $bio
# [1] "Ken Abdo is a partner at the national law firm of Fox Rothschild...<etc>