我正在进行一个学校项目,该项目涉及使用R来从https://www.baseball-reference.com中刮除玩家属性,并使用它们构建数据框架。该网站按字母顺序列出了所有播放器,并且我编写了为以下每个字母创建网址的代码:
# every baseball player is identified by their last name, using all the
letters allows me to build urls with the letters
ltrs <- letters
# create an empty container for the urls
url_container <- c()
# this is the base url I append letters to
url = "https://www.baseball-reference.com/players/"
# use a for loop to create the urls
for(i in 1:length(ltrs)){
url_start = paste(url, ltrs[i], "/", sep = '')
url_container = c(url_container, url_start)
}
# print the container to make sure the urls are correctly constructed
url_container
# This Outputs: [1] <https://www.baseball-reference.com/players/a/>
<https://www.baseball-reference.com/players/b/> etc.
每个页面上都有一定数量的播放器,我可以使用下面的代码进行抓取,该代码输出播放器URL的列表。
player_quantity <- c()
for(i in 1:length(url_container)){
raw = read_html(url_container[i])
player_count <- raw %>%
# this is where the player count lives
html_nodes(.,xpath="//*[@id='all_players_']/div[1]/h2") %>%
# cast the value as an integer (it will define how many tags we go through)
html_text(.)
player_quantity <- c(player_quantity, player_count)
}
player_quantity <- as.numeric(gsub("([0-9]+).*$", "\\1", player_quantity))
player_quantity
# Outputs this:
[1] 593 1847 1504 945 352 691 1056 1395 58 505 706 885 2015 337 360 925 49 1065 1894 637
[21] 60 269 1075 0 113 93
我正在努力做的是使用这些元素来遍历每个元素,复制每个播放器的URL,然后运行我的代码以提取播放器属性(我已经编写并且可以使用,但不适用于此属性)。问题。
玩家的XPath看起来像这样:“ // * [@ id =” div_players _“] / p [1] / a”,这是我从Reading table from https webpage using readHTMLTable编写/复制的代码到目前为止,但它运行时似乎未返回任何内容,我不确定为什么。
mainweb="https://www.baseball-reference.com/players/"
urls = read_html("https://www.baseball-reference.com/players/a/") %>%
html_nodes("#active a") %>%
html_attrs()
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE)))
%>%
html_table()
j=j+1
}
任何帮助或想法都将不胜感激!
答案 0 :(得分:1)
以下内容将使您了解所有名称及其相关链接。从那里,您应该能够在链接上循环或映射,并应用处理和/或html_table
提取:
library(tidyverse)
library(rvest)
base_url <- "https://www.baseball-reference.com"
# Only doing this for the first four letters, just change to letters[1:26]
links_by_letter <- paste0(base_url, "/players/", letters[1:4])
# Create a function that returns the links for a given letter
get_links_for_letter <- function(url) {
# Using httr::RETRY in case we are burdening the server
link_elements <- read_html(httr::RETRY("GET", url)) %>%
html_nodes("#div_players_ a")
links <- link_elements %>%
html_attr("href") %>%
paste0(base_url, .) %>%
set_names(., nm = link_elements %>% html_text)
return(links)
}
# Make 'safe' version that throws an NA in case we do not get anything back.
safe_get_links_for_letter <- possibly(~ get_links_for_letter(.x), otherwise = NA)
results <-
links_by_letter %>%
map(~ safe_get_links_for_letter(.)) %>%
map_df(enframe)
head(results)
# # A tibble: 6 x 2
# name value
# <chr> <chr>
# 1 David Aardsma https://www.baseball-reference.com/players/a/aardsda01.shtml
# 2 Hank Aaron https://www.baseball-reference.com/players/a/aaronha01.shtml
# 3 Tommie Aaron https://www.baseball-reference.com/players/a/aaronto01.shtml
# 4 Don Aase https://www.baseball-reference.com/players/a/aasedo01.shtml
# 5 Andy Abad https://www.baseball-reference.com/players/a/abadan01.shtml
# 6 Fernando Abad https://www.baseball-reference.com/players/a/abadfe01.shtml
tail(results)
# # A tibble: 6 x 2
# name value
# <chr> <chr>
# 1 Radhames Dykhoff https://www.baseball-reference.com/players/d/dykhora01.shtml
# 2 Allan Dykstra https://www.baseball-reference.com/players/d/dykstal01.shtml
# 3 Lenny Dykstra https://www.baseball-reference.com/players/d/dykstle01.shtml
# 4 John Dyler https://www.baseball-reference.com/players/d/dylerjo01.shtml
# 5 Jarrod Dyson https://www.baseball-reference.com/players/d/dysonja01.shtml
# 6 Sam Dyson https://www.baseball-reference.com/players/d/dysonsa01.shtml