Question

我正在尝试抓取https://www.nyse.com/bell/calendar。出于某些原因，当我拉动html时，它返回到我使用inspect elemt来查看html时可以找到的不同的html。我使用了以下功能：

SetDir = "~/NYSE/"

setwd(SetDir)

CreateDir = paste(SetDir, "RawData/", sep="")

if("RawData" %in% dir(SetDir)==FALSE){
  dir.create(CreateDir)
}



    url = paste("https://www.nyse.com/bell/calendar", sep="")
    urlname <- paste(CreateDir, ".html", sep="")
    err <- try(download.file(url,destfile = urlname, quiet=FALSE), silent=TRUE)
    if(class(err)=="try-error"){
      Sys.sleep(5)
      try(download.file(url,destfile = urlname, quiet=FALSE), silent=TRUE)
    }

在上述命令后我收到以下警告信息：

Warning message:
In download.file(url, destfile = urlname, method = "internal", mode = "w",  :
  downloaded length 18598 != reported length 200

我甚至尝试过使用RCurl等软件包的非常简单的函数：

script <- readLines("https://www.nyse.com/bell/calendar")
script <- getURL("https://www.nyse.com/bell/calendar")

我没有得到与之前相同的HTML错误，这与在网站上检查时的错误不同。由于某种原因，它似乎没有检索我正在寻找的HTML。当我尝试其他网站时，这些方法都有效。关于这里发生了什么，我有点迷失，这个特定网站是否受到保护？

谢谢。

R网络抓取/抓取

0 个答案: