我有38,000多个网址列表。每个网址都有一张我想刮的桌子。例如:
library(rvest)
library(magrittr)
library(plyr)
#Doing URLs one by one
url<-"http://www.acpafl.org/ParcelResults.asp?Parcel=00024-006-000"
##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[4]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)
我想将此概括为Parcels_ID.txt中的所有网址。
#(1) Step one is to generate a list of urls that we want to scrape data from
parcels <- read.csv(file="Parcels_ID.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE) #import the data
parcelIDs <- as.vector(parcels$PARCELID) #reformat the column as a vector containing parcel IDs as individual elements of the vector
parcels$url = paste("http://www.acpafl.org/ParcelResults.asp?Parcel=", parcelIDs, sep="") #paste the web address and the parcel ID together to get the link to the parcel on the website
现在我有了这个,我想编写一个遍历每个url的循环并拉出表,然后将结果放入数据帧列表中。 这是我遇到麻烦的地方:
#(2) Step to is to write a for loop that will scrape the tables from the individual pages
compiled<-list()
for (i in seq_along(parcels$url)){
##GET SALES DATA
pricesdata <- read_html(parcels$url[i]) %>% html_nodes(xpath = "//table[4]") %>% html_table(fill=TRUE)
compiled[[i]] <- ldply(pricesdata, data.frame)
}
此代码永远不会完成。我会很感激任何可以发现错误或问题的鹰眼,或任何有关编写此循环的最佳实践的建议,这样可以从网站中提取表中的数据帧。
谢谢