Question

我正在尝试通过webscrape tax-rates.org来获得德克萨斯州每个县的平均税率。我有一个csv文件中255个县的列表，我导入为“TX_counties”，它是一个列表。我必须为每个县创建一个字符串的URL，所以我使用[i，1]将d1设置为第一个单元格，然后将其连接成一个URL字符串，执行scrape，然后将+1添加到[i]中它会转到下一个县名的第二个单元格，然后继续进行。

问题是我无法弄清楚如何将scrape结果存储到“增长列表”中，然后我想将其制作成表并最后保存到.csv文件中。我一次只能抓一个县，然后重写自己。

有什么想法？（相当新的R和一般的刮擦）

i <- 1
for (i in 1:255) {

  d1 <- as.character(TX_counties[i,1])

  uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')

  html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)

  avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)

  t1 <- data.table(d1,avg_taxrate)

  i <- i+1

}

write.csv(t1,"2015_TX_PropertyTaxes.csv")

Answer 1

这使用了rvest，提供了一个进度条，并利用了页面上已经存在URL的事实：

library(rvest)
library(pbapply)

pg <- read_html("http://www.tax-rates.org/texas/property-tax")

# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")

# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))

# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
  cty_pg <- read_html(URL)
  html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)

tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)

tax_table
##   county_name              county_rate
## 1    anderson Avg. 1.24% of home value
## 2     andrews Avg. 0.88% of home value
## 3    angelina Avg. 1.35% of home value
## 4     aransas Avg. 1.29% of home value

write.csv(tax_table, "2015_TX_PropertyTaxes.csv")

注意1：我将抓取限制为4，不会破坏提供免费数据的网站的带宽。

注意2：该网站上只有254个县的税收链接，所以如果你有255个，你似乎还有一个。

Answer 2

library(RCurl)
library(XML)
tx_c <- c("anderson", "andrews")

res <- sapply(1:2, function(x){
    d1 <- as.character(tx_c[x])
    uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
    html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
    avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
    return(c(d1, avg_taxrate))
})

res.df <- data.frame(t(res), stringsAsFactors = FALSE)
names(res.df) <- c("county", "property")
res.df
#    county                 property
# 1 anderson Avg. 1.24% of home value
# 2  andrews Avg. 0.88% of home value

Answer 3

首先应该初始化一个列表来存储每个循环所刮取的数据。确保在进入循环之前初始化它

然后，在每次迭代时，在开始下一次迭代之前附加到列表中。看到我的回答

Web Scraping in R with loop from data.frame

使用循环通过Web抓取创建表

3 个答案: