从.asp刮取表格

时间:2018-09-03 16:33:24

标签: r web-scraping

我正在尝试抓取this site的主表。 我对网络开发的知识不多,但是在完成一些教程后,我已经抓取了其他网站。

由于我没有发现有关.asp的任何特殊信息,因此我尝试遵循一些教程,例如this

但是,当我运行以下代码时,它将返回一个空列表。为什么返回此空列表?如何获取表格数据?

library(rvest)

url <- "http://www2.aneel.gov.br/scg/gd/VerGD.asp?pagina=1&acao=buscar&login=&NomPessoa=&IdAgente=&DatConexaoInicio=&DatConexaoFim="

table<-url%>%
    read_html()%>%
    html_nodes(xpath="/html/body/table/tbody/tr[4]/td/table[4]")%>%
    html_table()

1 个答案:

答案 0 :(得分:1)

该站点的HTML格式严重错误,导致libxml2rvestxml2 R程序包后面的C库)在解析时受阻。

我跑了

htmltidy::tidy_html()

仅对网站中未解析的HTML内容进行回吐:

## Tidy found 1551 warnings and 8 errors! Not all warnings/errors were shown.

有很多行指出必须丢弃许多表行。

内容的一个大问题是:

enter image description here

(粘贴<form>的非法位置),其他大多数表行都缺少开头的<tr>。应该禁止对该网站进行编码的人员一生中再制作任何Web内容。

因此,我们必须在解析之前编辑HTML。而且,当我们使用它时,我们还可以支持该主要表单的参数:

get_power_info <- function(pagina = 1L, acao = "buscar", login = "",
                     nom_pessoa = "", id_agente = "", dat_conexao_inicio = "",
                     dat_conexao_fim = "") {

  # we need alot of packages to make this work

  suppressPackageStartupMessages({
    require("httr", quietly = TRUE, warn.conflicts = FALSE)
    require("xml2", quietly = TRUE, warn.conflicts = FALSE)
    require("rvest", quietly = TRUE, warn.conflicts = FALSE)
    require("janitor", quietly = TRUE, warn.conflicts = FALSE)
    require("stringi", quietly = TRUE, warn.conflicts = FALSE)
    require("dplyr", quietly = TRUE, warn.conflicts = FALSE)
  })

  # get the page like a browser

  httr::GET(
    url = "http://www2.aneel.gov.br/scg/gd/VerGD.asp",
    query = list(
      pagina = as.character(as.integer(pagina)),
      acao = acao,
      login = login,
      NomPessoa = nom_pessoa, 
      IdAgente = id_agente,
      DatConexaoInicio = dat_conexao_inicio,
      DatConexaoFim = dat_conexao_fim
    )
  ) -> res

  httr::stop_for_status(res)

  # DON'T PARSE IT YET

  out <- httr::content(res, as = "text")

  # Remove beginning & trailing whitespace from lines

  l <- stri_trim_both(stri_split_lines(out)[[1]])

  # Now, remove all form-component lines and all blank lines

  l[-c(
    which(grepl("<form", l, fixed = TRUE)), 
    which(grepl("<input", l, fixed = TRUE)),
    which(l == "")
  )] -> l

  # Get the indices of all the <td> tags that should have a <tr> before them but dont

  to_fix <- c()
  for (i in 1:(length(l)-1)) {
    if (all(c(
      grepl("/tr", l[i]), grepl("td", l[i+1])
    ))) {
      to_fix <- c(to_fix, (i+1))
    }

  }

  # Fix them

  l[to_fix] <- sprintf("<tr>%s", l[to_fix])

  # NOW WE CAN PARSE IT

  x <- read_html(paste0(l, collapse="\n"))

  # Find the table in a less breakable way

  tabl <- html_nodes(x, xpath=".//table[@class = 'tabelaMaior']/tr/td[contains(., 'UNIDADES')]/../..")

  # Remove the useless title row that makes html_table() cry

  xml_remove(html_node(tabl, xpath=".//tr[1]"))

  # Remove the bottom pagination row that makes html_table() cry

  xml_remove(html_node(tabl, xpath=".//tr/td[@colspan = '20']/.."))

  # Extract the table with better column names

  xdat <- html_table(tabl, header=TRUE, trim=TRUE)[[1]] 
  xdat <- janitor::clean_names(xdat)
  xdat <- dplyr::tbl_df(xdat)

  xdat

}

看到它的作用:

xdf <- get_power_info()
## # A tibble: 1,000 x 14
##    distribuidora    codigo_da_gd  titular_da_uc   classe subgrupo modalidade
##    <chr>            <chr>         <chr>           <chr>  <chr>    <chr>     
##  1 Companhia Energ… GD.PI.000.04… MARCELO FORTES… Resid… B1       Geracao n…
##  2 Companhia Energ… GD.PI.000.04… JOSE DE JESUS … Resid… B1       Geracao n…
##  3 Companhia de El… GD.BA.000.04… MAURICIO FRAGO… Resid… B1       Geracao n…
##  4 Companhia Energ… GD.PI.000.04… VALTER CID MEN… Resid… B1       Geracao n…
##  5 COPEL DISTRIBUI… GD.PR.000.04… WALDEMAR PERES… Comer… B3       Geracao n…
##  6 CELG DISTRIBUIÇ… GD.GO.000.04… Alfredo Ambrós… Resid… B1       Geracao n…
##  7 CELG DISTRIBUIÇ… GD.GO.000.04… Reginaldo Rosa… Resid… B1       Geracao n…
##  8 Companhia Energ… GD.PI.000.04… LIVIO JEFFERSO… Resid… B1       Geracao n…
##  9 Companhia Energ… GD.PI.000.04… Francislene Me… Resid… B1       Geracao n…
## 10 Companhia Energ… GD.PI.000.04… LUISA MARIA MO… Resid… B1       Geracao n…
## # ... with 990 more rows, and 8 more variables:
## #   quantidade_de_u_cs_que_recebem_os_creditos <int>, municipio <chr>,
## #   uf <chr>, cep <chr>, data_conexao <chr>, tipo <chr>, fonte <chr>,
## #   potencia_instalada_k_w <chr>

注意:您可能希望首先实际刮取该分页行(在删除它之前)以获取所有分页链接,以便您可以重复进行该刮取。我不知道参数是否按预期工作,但您也应该为此做些工作;-)