Question

我正在尝试从“ https://www.zipcodestogo.com/county-zip-code-list.htm”中检索邮政编码，该州和县将在数据集中提供。以阿拉巴马州戴尔为例（如下所示）。但是，当我使用Selector Gadget提取表时，它不会出现，并且当我查看源代码时，我也找不到此表。我不确定该如何解决。我是网络爬虫的新手，所以如果这是一个愚蠢的问题，我先向您道歉。谢谢。

zipurl = 'https://www.zipcodestogo.com/county-zip-code-list.htm'
query = list('State:'="Alabama",
              'Counties:'="Dale"
)
website = POST(zipurl, body = query,encode = "form")
tables <- html_nodes(content(website), css = 'table')

Answer 1

您可以使用在检查>选项卡网络

中使用浏览器找到的链接。

这里是一个解决方案：

state = "ALABAMA"
county = "DALE"
url_scrape = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county) # Inspect > Network > XHR links

# function => First letter Capital (needed for regexp)
capwords <- function(s, strict = T) { # You can find this function on the forum
  cap <- function(s) paste(toupper(substring(s, 1, 1)),
                           {s <- substring(s, 2); if(strict) tolower(s) else s},
                           sep = "", collapse = " " )
  sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}

zip_codes = read_html(url_scrape) %>% html_nodes("td") %>% html_text()
zip_codes = zip_codes[-c(1:6)] # Delete header
string_regexp = paste0(capwords(state),"|View") # pattern as var
zip_codes = zip_codes[-grep(pattern = string_regexp,zip_codes)]
df = data.frame(zip = zip_codes[grep("\\d",zip_codes)], label = zip_codes[-grep("\\d",zip_codes)])

Answer 2

相同的想法，但是要抓住表并删除标题

library(rvest)
state = "ALABAMA"
county = "DALE"
url = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county)

r <- read_html(url) %>%
  html_node("table table") %>% 
  html_table()%>%
  slice(-1)

print(r)

则只有邮政编码列：

r$X1

您还可以限制在表格的第一列并删除第一行：

r <- read_html(url) %>%
  html_nodes("table table td:nth-of-type(1)") %>% 
  html_text() %>% 
  as.character

print(r[-1])

使用R在网页上抓取弹出式表格而无需提交按钮

2 个答案: