使用R在网页上抓取弹出式表格而无需提交按钮

时间:2019-05-25 17:03:07

标签: css r web-scraping

我正在尝试从“ https://www.zipcodestogo.com/county-zip-code-list.htm”中检索邮政编码,该州和县将在数据集中提供。以阿拉巴马州戴尔为例(如下所示)。但是,当我使用Selector Gadget提取表时,它不会出现,并且当我查看源代码时,我也找不到此表。我不确定该如何解决。我是网络爬虫的新手,所以如果这是一个愚蠢的问题,我先向您道歉。谢谢。

zipurl = 'https://www.zipcodestogo.com/county-zip-code-list.htm'
query = list('State:'="Alabama",
              'Counties:'="Dale"
)
website = POST(zipurl, body = query,encode = "form")
tables <- html_nodes(content(website), css = 'table')

2 个答案:

答案 0 :(得分:1)

您可以使用在检查>选项卡网络

中使用浏览器找到的链接。

这里是一个解决方案:

state = "ALABAMA"
county = "DALE"
url_scrape = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county) # Inspect > Network > XHR links

# function => First letter Capital (needed for regexp)
capwords <- function(s, strict = T) { # You can find this function on the forum
  cap <- function(s) paste(toupper(substring(s, 1, 1)),
                           {s <- substring(s, 2); if(strict) tolower(s) else s},
                           sep = "", collapse = " " )
  sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}

zip_codes = read_html(url_scrape) %>% html_nodes("td") %>% html_text()
zip_codes = zip_codes[-c(1:6)] # Delete header
string_regexp = paste0(capwords(state),"|View") # pattern as var
zip_codes = zip_codes[-grep(pattern = string_regexp,zip_codes)]
df = data.frame(zip = zip_codes[grep("\\d",zip_codes)], label = zip_codes[-grep("\\d",zip_codes)])

答案 1 :(得分:1)

相同的想法,但是要抓住表并删除标题

library(rvest)
state = "ALABAMA"
county = "DALE"
url = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county)

r <- read_html(url) %>%
  html_node("table table") %>% 
  html_table()%>%
  slice(-1)

print(r)

则只有邮政编码列:

r$X1

您还可以限制在表格的第一列并删除第一行:

r <- read_html(url) %>%
  html_nodes("table table td:nth-of-type(1)") %>% 
  html_text() %>% 
  as.character

print(r[-1])