Question

我构建了一个简单的抓取工具，以获取具有2020年NFL草案结果的数据框。我打算使用此代码来映射多年的结果，但是由于某种原因，当我为任何其他内容更改单页抓取的代码时， 2020年之前，我在底部发现了错误。

library(tidyverse)
library(rvest)
library(httr)
library(curl)

尽管col名称位于第1行，但这对2020年的刮擦是完美的，因为对我来说这没什么大不了的，因为我以后可以处理（尽管提及此问题可能与问题有关）：< / p>

x <- "https://www.pro-football-reference.com/years/2020/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>% 
        html_nodes("table") %>% 
        html_table() %>%
        as.data.frame()

网址下方的

从2020更改为2019，这是一个活动页面，具有相同格式的表格。由于某些原因，与上述相同的调用无法正常工作：

x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>% 
        html_nodes("table") %>% 
        html_table() %>%
        as.data.frame()

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
arguments imply differing number of rows: 261, 2

Answer 1

在提供的网址上有两个表。有核心草案（表1，id = "drafts"）和补充草案（表2，id = "drafts_supp"）。

as.data.frame()调用失败，因为它试图合并两个表，但是它们的名称和数字列不同。通过为rvest提供html_node()或xpath，可以指示selector仅读取您感兴趣的特定表。您可以通过检查您感兴趣的特定表来找到xpath或selector，右键单击>在Chrome / Mozilla上检查。请注意，要使选择器使用ID，您不仅需要使用#drafts，还需要使用drafts；对于xpath，通常必须将文本用单引号引起来。

这有效：html_node(xpath = '//*[@id="drafts"]')
这不是因为双引号引起的：html_node(xpath = "//*[@id="drafts"]")

请注意，我相信您的示例中使用的html_nodes("table")是不必要的，因为html_table()已经只选择了表。

x <- "https://www.pro-football-reference.com/years/2019/draft.htm"

raw_html <- read_html(x)

# use xpath
raw_html %>% 
  html_node(xpath = '//*[@id="drafts"]') %>%
  html_table()

# use selector
raw_html %>% 
  html_node("#drafts") %>% 
  html_table()

相同的webscrape代码可在一个页面上工作，而在使用rvest的页面上则不能工作

1 个答案: