我正在尝试抓取this site的主表。 我对网络开发的知识不多,但是在完成一些教程后,我已经抓取了其他网站。
由于我没有发现有关.asp的任何特殊信息,因此我尝试遵循一些教程,例如this。
但是,当我运行以下代码时,它将返回一个空列表。为什么返回此空列表?如何获取表格数据?
library(rvest)
url <- "http://www2.aneel.gov.br/scg/gd/VerGD.asp?pagina=1&acao=buscar&login=&NomPessoa=&IdAgente=&DatConexaoInicio=&DatConexaoFim="
table<-url%>%
read_html()%>%
html_nodes(xpath="/html/body/table/tbody/tr[4]/td/table[4]")%>%
html_table()
答案 0 :(得分:1)
该站点的HTML格式严重错误,导致libxml2
(rvest
和xml2
R程序包后面的C库)在解析时受阻。
我跑了
htmltidy::tidy_html()
仅对网站中未解析的HTML内容进行回吐:
## Tidy found 1551 warnings and 8 errors! Not all warnings/errors were shown.
有很多行指出必须丢弃许多表行。
内容的一个大问题是:
(粘贴<form>
的非法位置),其他大多数表行都缺少开头的<tr>
。应该禁止对该网站进行编码的人员一生中再制作任何Web内容。
因此,我们必须在解析之前编辑HTML。而且,当我们使用它时,我们还可以支持该主要表单的参数:
get_power_info <- function(pagina = 1L, acao = "buscar", login = "",
nom_pessoa = "", id_agente = "", dat_conexao_inicio = "",
dat_conexao_fim = "") {
# we need alot of packages to make this work
suppressPackageStartupMessages({
require("httr", quietly = TRUE, warn.conflicts = FALSE)
require("xml2", quietly = TRUE, warn.conflicts = FALSE)
require("rvest", quietly = TRUE, warn.conflicts = FALSE)
require("janitor", quietly = TRUE, warn.conflicts = FALSE)
require("stringi", quietly = TRUE, warn.conflicts = FALSE)
require("dplyr", quietly = TRUE, warn.conflicts = FALSE)
})
# get the page like a browser
httr::GET(
url = "http://www2.aneel.gov.br/scg/gd/VerGD.asp",
query = list(
pagina = as.character(as.integer(pagina)),
acao = acao,
login = login,
NomPessoa = nom_pessoa,
IdAgente = id_agente,
DatConexaoInicio = dat_conexao_inicio,
DatConexaoFim = dat_conexao_fim
)
) -> res
httr::stop_for_status(res)
# DON'T PARSE IT YET
out <- httr::content(res, as = "text")
# Remove beginning & trailing whitespace from lines
l <- stri_trim_both(stri_split_lines(out)[[1]])
# Now, remove all form-component lines and all blank lines
l[-c(
which(grepl("<form", l, fixed = TRUE)),
which(grepl("<input", l, fixed = TRUE)),
which(l == "")
)] -> l
# Get the indices of all the <td> tags that should have a <tr> before them but dont
to_fix <- c()
for (i in 1:(length(l)-1)) {
if (all(c(
grepl("/tr", l[i]), grepl("td", l[i+1])
))) {
to_fix <- c(to_fix, (i+1))
}
}
# Fix them
l[to_fix] <- sprintf("<tr>%s", l[to_fix])
# NOW WE CAN PARSE IT
x <- read_html(paste0(l, collapse="\n"))
# Find the table in a less breakable way
tabl <- html_nodes(x, xpath=".//table[@class = 'tabelaMaior']/tr/td[contains(., 'UNIDADES')]/../..")
# Remove the useless title row that makes html_table() cry
xml_remove(html_node(tabl, xpath=".//tr[1]"))
# Remove the bottom pagination row that makes html_table() cry
xml_remove(html_node(tabl, xpath=".//tr/td[@colspan = '20']/.."))
# Extract the table with better column names
xdat <- html_table(tabl, header=TRUE, trim=TRUE)[[1]]
xdat <- janitor::clean_names(xdat)
xdat <- dplyr::tbl_df(xdat)
xdat
}
看到它的作用:
xdf <- get_power_info()
## # A tibble: 1,000 x 14
## distribuidora codigo_da_gd titular_da_uc classe subgrupo modalidade
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Companhia Energ… GD.PI.000.04… MARCELO FORTES… Resid… B1 Geracao n…
## 2 Companhia Energ… GD.PI.000.04… JOSE DE JESUS … Resid… B1 Geracao n…
## 3 Companhia de El… GD.BA.000.04… MAURICIO FRAGO… Resid… B1 Geracao n…
## 4 Companhia Energ… GD.PI.000.04… VALTER CID MEN… Resid… B1 Geracao n…
## 5 COPEL DISTRIBUI… GD.PR.000.04… WALDEMAR PERES… Comer… B3 Geracao n…
## 6 CELG DISTRIBUIÇ… GD.GO.000.04… Alfredo Ambrós… Resid… B1 Geracao n…
## 7 CELG DISTRIBUIÇ… GD.GO.000.04… Reginaldo Rosa… Resid… B1 Geracao n…
## 8 Companhia Energ… GD.PI.000.04… LIVIO JEFFERSO… Resid… B1 Geracao n…
## 9 Companhia Energ… GD.PI.000.04… Francislene Me… Resid… B1 Geracao n…
## 10 Companhia Energ… GD.PI.000.04… LUISA MARIA MO… Resid… B1 Geracao n…
## # ... with 990 more rows, and 8 more variables:
## # quantidade_de_u_cs_que_recebem_os_creditos <int>, municipio <chr>,
## # uf <chr>, cep <chr>, data_conexao <chr>, tipo <chr>, fonte <chr>,
## # potencia_instalada_k_w <chr>
注意:您可能希望首先实际刮取该分页行(在删除它之前)以获取所有分页链接,以便您可以重复进行该刮取。我不知道参数是否按预期工作,但您也应该为此做些工作;-)