所以,我正在努力从以下网站提取数据:http://livingwage.mit.edu ...在县级,并尝试过使用rvest包提取数据的许多不同迭代。不幸的是,大约有5K个县。
我已将所有网址解压缩到一个.csv文件列中。网址格式为“http://livingwage.mit.edu/counties/ ...”,其中“...”是县代码后面的州代码。
我想要的数据的css标识符为(来自SelectorGadget)
SecondaryViewController
或
的xpathcss = '.wages_table .even .col-NaN , .wages_table .results .col-NaN'
这是我开始的地方:
xpath = //*[contains(concat( " ", @class, " " ), concat( " ", "wages_table", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "even", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "col-NaN", " " ))] | //*[contains(concat( " ", @class, " " ), concat( " ", "wages_table", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "results", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "col-NaN", " " ))]
...但是一次只能提取一个表,得到了标题和最后一行,这是我不想要的。
所以,我试过这样的事情:
library(rvest)
url <- read_html("http://livingwage.mit.edu/counties/01001")
url %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
......但很快意识到并非所有的数字都是连续的(它们大部分是奇数),但也不是连续的,即一个州可能只有4个县,而且只有网址达到〜/ 10009例如。
最后,当我尝试访问桌面上的.csv网址列表时,我得到了这个:
counties <- 01001:54500
urls <- paste0("http://livingwage.mit.edu/counties/", counties)
get_table <- function(url) {
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
results <- sapply(urls, get_table)
...并且知道css和阅读都不喜欢很好地互相交谈。
任何帮助实现这一目标都将受到全面的欢迎。
答案 0 :(得分:1)
我认为这就是你要找的东西。
install.packages("pbapply") # has a nice addition to lapply, estimates run time
library(rvest)
library(dplyr)
library(magrittr)
library(pbapply)
## Get State urls
lwc.url <- "http://livingwage.mit.edu"
state.urls <- read_html(lwc.url)
state.urls %<>% html_nodes(".col-md-6 a") %>% xml_attr("href") %>%
paste0(lwc.url, .)
## get county urls and county names
county.urls <- lapply(state.urls, function(x) read_html(x) %>%
html_nodes(".col-md-3 a") %>% xml_attr("href") %>%
paste0(lwc.url, .)) %>% unlist
## Get the tables Hourly wage & typical Expenses
dfs <- pblapply(county.urls, function(x){
LWC <- read_html(x)
df <- rbind(
LWC %>% html_nodes("table") %>% .[[1]] %>%
html_table() %>% setNames(c("Info", names(.)[-1])),
LWC %>% html_nodes("table") %>% .[[2]] %>%
html_table() %>% setNames(c("Info", names(.)[-1])))
title <- LWC %>% html_nodes("h1") %>% html_text
df$State <- trimws(gsub(".*,", "", title))
df$County <- trimws(gsub(".*for (.*) County.*", "\\1", title))
df$url <- x
df
})
df <- data.table::rbindlist(dfs)
View(df)