Question

我正在使用RSelenium，docker和rvest在一个网站上收集数据以进行研究。

我构建了一个脚本，该脚本会自动“点击”我要下载内容的页面。 我的问题是，当我运行此脚本时，结果会更改。我对变化感兴趣的变量的观测量。它涉及约50.000个观测值。多次运行该脚本时，观察的总数相差数百倍。

我认为这与互联网连接速度太慢或网站无法足够快速地加载有关……或其他原因……当我更改Sys.sleep(2)时，结果也随之更改，但没有明显的效果，将其更改为更高的数字会使效果更糟或更好。

在R终端中，我运行：

docker run -d -p 4445:4444 selenium/standalone-chrome

然后我的代码如下所示：

remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                             port = 4445L,
                             browserName = "chrome")
remDr$open()
remDr$navigate("url of website")
pages <- 100 # for example, I want information from the first hundred pages
variable <- vector("list", pages)  
i <- 1
while (i <= pages) {
    variable[[i]] <- remDr$getPageSource()[[1]] %>% 
    read_html(encoding = "UTF-8") %>% 
    html_nodes("node that indicates the information I want") %>% # select the information I want
    html_text()
    element_next_page <- remDr$findElement(using = 'css selector', "node that indicates the 'next page button") # select button with which I can go to the next page
    element_next_page$sendKeysToElement(list(key="enter")) # go to the next page
    Sys.sleep(2) # I believe this is done to not overload the website I'm scraping
    i <- i + 1
    }
variable <- unlist(variable)

以某种方式多次运行，这会在我取消列出variable时保留的观察数方面不断返回不同的结果。

有人对做事有相同的经历和提示吗？

谢谢。

R网站抓取慢速/超负荷（？）网站

0 个答案: