有下拉框时抓取

时间:2020-10-27 04:58:45

标签: r rvest

我正在尝试从以下网站抓取数据:https://electproject.github.io/Early-Vote-2020G/SC.html

我可以使用以下代码彻底清除第一个:

require(xml2)
require(rvest)
require(janitor)
require(tidyverse)

link <- "https://electproject.github.io/Early-Vote-2020G/SC.html"
raw_webpage <- read_html(link)
html_table(raw_webpage, fill = TRUE)[[1]]

但是,页面上包含县级信息的第二张表却让我感到困惑。有人知道如何抓取r

  1. Show Entries下拉框中选择100
  2. 删除所有县级详细信息

谢谢。

1 个答案:

答案 0 :(得分:3)

在网页上,此数据是使用DataTables(这是一个JavaScript插件)显示的,不能直接使用rvest进行交互。幸运的是,数据随页面一起加载,并且可以直接从htmlwidget中提取:

library(rvest)
library(jsonlite)
library(purrr)
library(dplyr)

link <- "https://electproject.github.io/Early-Vote-2020G/SC.html"
raw_webpage <- read_html(link)

json_dat <- raw_webpage %>%
  html_node(xpath = "//script[@data-for='htmlwidget-0ecc3fac592e3c6771ab']") %>%
  html_text() %>%
  fromJSON()

json_dat %>% 
  purrr::pluck("x", "data") %>%
  t() %>%
  as_tibble(.name_repair = "minimal") %>%
  set_names(json_dat %>% 
              purrr::pluck("x", "container") %>%
              read_html() %>%
              html_table() %>%
              map(names) %>%
              unlist()) %>%
  mutate(across(-1, as.numeric))

# A tibble: 46 x 4
   County        `Mail Ballots Requested` `Mail Ballot Returned` `Percent Returned`
   <chr>                            <dbl>                  <dbl>              <dbl>
 1 01-ABBEVILLE                      4617                   4392              0.951
 2 02-AIKEN                         31520                  28028              0.889
 3 03-ALLENDALE                      1836                   1701              0.926
 4 04-ANDERSON                      26596                  22574              0.849
 5 05-BAMBERG                        2609                   2417              0.926
 6 06-BARNWELL                       4007                   3661              0.914
 7 07-BEAUFORT                      47448                  41409              0.873
 8 08-BERKELEY                      42517                  37657              0.886
 9 09-CALHOUN                        2985                   2704              0.906
10 10-CHARLESTON                   122073                 105124              0.861
# ... with 36 more rows

要将其扩展为刮擦可用状态的功能,可以生成链接列表并对其进行迭代:

links <- set_names(paste0("https://electproject.github.io/Early-Vote-2020G/", state.abb, ".html"), state.name)

states_dat <- imap(links[8:12], ~ {
  tryCatch({
    
    raw_webpage <- read_html(.x)
    
    # Get html widget ID
    hw_id <- raw_webpage %>%
      html_node("div.datatables") %>%
      html_attr("id")
    
    json_dat <- raw_webpage %>%
      html_node(xpath = paste0("//script[@data-for='", hw_id, "']")) %>%
      html_text() %>%
      fromJSON()
    
    json_dat %>%
      purrr::pluck("x", "data") %>%
      t() %>%
      as_tibble(.name_repair = "minimal") %>%
      set_names(
        json_dat %>%
          purrr::pluck("x", "container") %>%
          read_html() %>%
          html_table() %>%
          map(names) %>%
          unlist()
      ) %>%
      mutate(across(-1, as.numeric))
  },
  error = function(e)
    paste(.y, "data not available:", e))
})