我正在尝试从以下网站抓取数据:https://electproject.github.io/Early-Vote-2020G/SC.html
我可以使用以下代码彻底清除第一个:
require(xml2)
require(rvest)
require(janitor)
require(tidyverse)
link <- "https://electproject.github.io/Early-Vote-2020G/SC.html"
raw_webpage <- read_html(link)
html_table(raw_webpage, fill = TRUE)[[1]]
但是,页面上包含县级信息的第二张表却让我感到困惑。有人知道如何抓取r
:
Show Entries
下拉框中选择100 谢谢。
答案 0 :(得分:3)
在网页上,此数据是使用DataTables
(这是一个JavaScript插件)显示的,不能直接使用rvest
进行交互。幸运的是,数据随页面一起加载,并且可以直接从htmlwidget中提取:
library(rvest)
library(jsonlite)
library(purrr)
library(dplyr)
link <- "https://electproject.github.io/Early-Vote-2020G/SC.html"
raw_webpage <- read_html(link)
json_dat <- raw_webpage %>%
html_node(xpath = "//script[@data-for='htmlwidget-0ecc3fac592e3c6771ab']") %>%
html_text() %>%
fromJSON()
json_dat %>%
purrr::pluck("x", "data") %>%
t() %>%
as_tibble(.name_repair = "minimal") %>%
set_names(json_dat %>%
purrr::pluck("x", "container") %>%
read_html() %>%
html_table() %>%
map(names) %>%
unlist()) %>%
mutate(across(-1, as.numeric))
# A tibble: 46 x 4
County `Mail Ballots Requested` `Mail Ballot Returned` `Percent Returned`
<chr> <dbl> <dbl> <dbl>
1 01-ABBEVILLE 4617 4392 0.951
2 02-AIKEN 31520 28028 0.889
3 03-ALLENDALE 1836 1701 0.926
4 04-ANDERSON 26596 22574 0.849
5 05-BAMBERG 2609 2417 0.926
6 06-BARNWELL 4007 3661 0.914
7 07-BEAUFORT 47448 41409 0.873
8 08-BERKELEY 42517 37657 0.886
9 09-CALHOUN 2985 2704 0.906
10 10-CHARLESTON 122073 105124 0.861
# ... with 36 more rows
要将其扩展为刮擦可用状态的功能,可以生成链接列表并对其进行迭代:
links <- set_names(paste0("https://electproject.github.io/Early-Vote-2020G/", state.abb, ".html"), state.name)
states_dat <- imap(links[8:12], ~ {
tryCatch({
raw_webpage <- read_html(.x)
# Get html widget ID
hw_id <- raw_webpage %>%
html_node("div.datatables") %>%
html_attr("id")
json_dat <- raw_webpage %>%
html_node(xpath = paste0("//script[@data-for='", hw_id, "']")) %>%
html_text() %>%
fromJSON()
json_dat %>%
purrr::pluck("x", "data") %>%
t() %>%
as_tibble(.name_repair = "minimal") %>%
set_names(
json_dat %>%
purrr::pluck("x", "container") %>%
read_html() %>%
html_table() %>%
map(names) %>%
unlist()
) %>%
mutate(across(-1, as.numeric))
},
error = function(e)
paste(.y, "data not available:", e))
})