Question

我正在学习如何使用R中的httr和XML来搜索网站上的信息。我认为它只适用于只有几张桌子的网站，但无法计算适用于有几张桌子的网站。使用pro-football-reference中的以下页面作为示例：https://www.pro-football-reference.com/boxscores/201609110atl.htm

# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]

# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2

所以我能够抓取信息，但出于某种原因，我只能捕获页面上20+的前两个表格。在练习中，我试图获得“Starters”表和“Officials”表。

我是否无法将其他表格置于网站设置或错误代码的问题上？

Answer 1

如果归结为R中的网页抓取，请大量使用rvest包。

虽然设法获得html几乎没问题 - rvest使用css选择器--SelectorGadget帮助在特定表的样式中找到一个模式，这有希望是唯一的。因此，您可以精确地提取您正在寻找的表而不是巧合

为了帮助您入门 - 请阅读rvest上的插图以获取更多详细信息。

#install.packages("rvest")
library(rvest)
library(magrittr)

# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"

linescore = fb_url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="content"]/div[3]/table') %>%
    html_table()

希望这有帮助。

R：在URL中刮取多个表

1 个答案: