循环浏览带有R和抓取数据的网址层

时间:2019-02-25 16:31:03

标签: r web-scraping

我正在进行一个学校项目,该项目涉及使用R来从https://www.baseball-reference.com中刮除玩家属性,并使用它们构建数据框架。该网站按字母顺序列出了所有播放器,并且我编写了为以下每个字母创建网址的代码:

# every baseball player is identified by their last name, using all the 
letters allows me to build urls with the letters
ltrs <- letters

# create an empty container for the urls
url_container <- c()

# this is the base url I append letters to
url = "https://www.baseball-reference.com/players/"

# use a for loop to create the urls
for(i in 1:length(ltrs)){
  url_start = paste(url, ltrs[i], "/", sep = '')
  url_container = c(url_container, url_start)
}

# print the container to make sure the urls are correctly constructed
url_container

# This Outputs: [1] <https://www.baseball-reference.com/players/a/> 
  <https://www.baseball-reference.com/players/b/> etc.

每个页面上都有一定数量的播放器,我可以使用下面的代码进行抓取,该代码输出播放器URL的列表。

player_quantity <- c()

for(i in 1:length(url_container)){
  raw = read_html(url_container[i])
  player_count <- raw %>%
  # this is where the player count lives
  html_nodes(.,xpath="//*[@id='all_players_']/div[1]/h2") %>%
  # cast the value as an integer (it will define how many tags we go through)
  html_text(.)
  player_quantity <- c(player_quantity, player_count)
}

player_quantity <- as.numeric(gsub("([0-9]+).*$", "\\1", player_quantity))
player_quantity

# Outputs this: 
[1]  593 1847 1504  945  352  691 1056 1395   58  505  706  885 2015  337  360  925   49 1065 1894  637
[21]   60  269 1075    0  113   93

我正在努力做的是使用这些元素来遍历每个元素,复制每个播放器的URL,然后运行我的代码以提取播放器属性(我已经编写并且可以使用,但不适用于此属性)。问题。

玩家的XPath看起来像这样:“ // * [@ id =” div_players _“] / p [1] / a”,这是我从Reading table from https webpage using readHTMLTable编写/复制的代码到目前为止,但它运行时似乎未返回任何内容,我不确定为什么。

mainweb="https://www.baseball-reference.com/players/"

urls = read_html("https://www.baseball-reference.com/players/a/") %>%
html_nodes("#active a") %>%
html_attrs()

teamdata=c()
j=1
for(i in urls){
  bball <- html(paste(mainweb, i, sep=""))
  teamdata[j]= bball %>%
  html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) 
  %>%
  html_table()
  j=j+1
}

任何帮助或想法都将不胜感激!

1 个答案:

答案 0 :(得分:1)

以下内容将使您了解所有名称及其相关链接。从那里,您应该能够在链接上循环或映射,并应用处理和/或html_table提取:

library(tidyverse)
library(rvest)

base_url <- "https://www.baseball-reference.com"

# Only doing this for the first four letters, just change to letters[1:26]
links_by_letter <- paste0(base_url, "/players/", letters[1:4])

# Create a function that returns the links for a given letter
get_links_for_letter <- function(url) {
  # Using httr::RETRY in case we are burdening the server
  link_elements <- read_html(httr::RETRY("GET", url)) %>%
    html_nodes("#div_players_ a")

  links <- link_elements %>%
    html_attr("href") %>%
    paste0(base_url, .) %>%
    set_names(., nm = link_elements %>% html_text)

  return(links)
}

# Make 'safe' version that throws an NA in case we do not get anything back.
safe_get_links_for_letter <- possibly(~ get_links_for_letter(.x), otherwise = NA)

results <- 
  links_by_letter %>%
  map(~ safe_get_links_for_letter(.)) %>%
  map_df(enframe)

head(results)
# # A tibble: 6 x 2
#   name          value                                                       
#   <chr>         <chr>                                                       
# 1 David Aardsma https://www.baseball-reference.com/players/a/aardsda01.shtml
# 2 Hank Aaron    https://www.baseball-reference.com/players/a/aaronha01.shtml
# 3 Tommie Aaron  https://www.baseball-reference.com/players/a/aaronto01.shtml
# 4 Don Aase      https://www.baseball-reference.com/players/a/aasedo01.shtml 
# 5 Andy Abad     https://www.baseball-reference.com/players/a/abadan01.shtml 
# 6 Fernando Abad https://www.baseball-reference.com/players/a/abadfe01.shtml 

tail(results)

# # A tibble: 6 x 2
#   name             value                                                       
#   <chr>            <chr>                                                       
# 1 Radhames Dykhoff https://www.baseball-reference.com/players/d/dykhora01.shtml
# 2 Allan Dykstra    https://www.baseball-reference.com/players/d/dykstal01.shtml
# 3 Lenny Dykstra    https://www.baseball-reference.com/players/d/dykstle01.shtml
# 4 John Dyler       https://www.baseball-reference.com/players/d/dylerjo01.shtml
# 5 Jarrod Dyson     https://www.baseball-reference.com/players/d/dysonja01.shtml
# 6 Sam Dyson        https://www.baseball-reference.com/players/d/dysonsa01.shtml