我尝试使用R.搜索ProQuest Archiver。我有兴趣查找包含特定关键字的报纸的文章数量。
使用rvest
工具通常效果很好。但是,程序有时会崩溃。看到这个最小的例子:
library(xml2)
library(rvest)
# Retrieve the title of the first search hit on the page of search results
for (p in seq(0, 150, 10)) {
searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", p, sep="")
htmlWeb <- read_html(searchURL)
nodeWeb <- html_node(htmlWeb, ".text tr:nth-child(1) .result_title a")
textWeb <- html_text(nodeWeb)
print(textWeb)
Sys.sleep(0.1)
}
这有时对我有用。但是如果我运行这个或类似的脚本几次,它会在同一点发生故障并且在第12次迭代(p=120
)上出现错误:
Error in open.connection(x, "rb") : HTTP error 503.
我试图通过暂停不断升级的长度来避免这种情况,但这并没有帮助。
我也考虑过:
感谢您的任何评论。
答案 0 :(得分:3)
尝试在延迟中更像人。这对我有用(多次尝试):
library(xml2)
library(httr)
library(rvest)
library(purrr)
library(dplyr)
to_get <- seq(0, 150, 10)
pb <- progress_estimated(length(to_get))
map_chr(to_get, function(i) {
pb$tick()$print()
searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", i, sep="")
htmlWeb <- read_html(searchURL)
nodeWeb <- html_node(htmlWeb, "td > font.result_title > a")
textWeb <- html_text(nodeWeb)
Sys.sleep(sample(10, 1) * 0.1)
textWeb
}) -> titles
print(trimws(titles))
## [1] "NEWSPAPER SPECIALS."
## [2] "NEWSPAPER SPECIALS."
## [3] "New Jersey Ice Co. Insolvent."
## [4] "NEWSPAPER SPECIALS."
## [5] "NEWSPAPER SPECIALS"
## [6] "AMERICAN ICE BEGINNING BUSY SEASON IN IMPROVED CONDITION."
## [7] "NEWSPAPER SPECIALS"
## [8] "THE GERMAN REICHSBANK."
## [9] "U.S. Exploration Co. Bankrupt."
## [10] "CHICAGO TRACTION."
## [11] "INCREASING FREIGHT RATES."
## [12] "A.O. BROWN & CO."
## [13] "BROAD STREET GOSSIP"
## [14] "Meadows, Williams & Co."
## [15] "FAILURES IN OCTOBER."
## [16] "Supplementary Receiver for Heinze & Co."
我随机化了睡眠调用值,简化了CSS目标,添加了一个进度条并自动创建了一个向量。你最终希望从这些数据中得到一个data.frame,所以?purrr::map_df
就是这样。
答案 1 :(得分:1)
最后,我们使用以下组合:
我们仍然无法完全访问所有网址,在这种情况下,我们只会将信息保存在发生的位置并继续。
感谢您的评论!