将CSV文件的URL读入矢量

时间:2017-11-14 02:59:05

标签: html r web-scraping rvest

这个问题类似于我提出的问题here。我已经尝试了答案中的技巧,但由于某种原因,我没有得到library(stringr) data.frame(str_match(df$val, "\\{([\\d,]+)(?:\\},\\{)?([\\d,]+)?\\}")[,-1]) 的匹配,我正在寻找一些额外的指导。

我正在尝试从以下网页抓取所有可下载CSV文件的网址:Page with CSV Files。这是我到目前为止所尝试的内容。

         col1  col2
1       36415  9904
2       36415 85610
3 85025,36415  <NA>
4       36415 36415
5       85610 36415
6        8872 36415

           X1    X2
1       36415  9904
2       36415 85610
3 85025,36415  <NA>
4       36415 36415
5       85610 36415
6        8872 36415

两次尝试都返回空字符向量。对xpath的调用都返回零元素列表。

当我在页面中搜索library(rvest) library(dplyr) myURL <- 'https://marketplace.spp.org/pages/rtbm-lmp-by-location#%2F2017%2F11%2FBy_Day' attempt1 <- read_html( myURL ) %>% html_nodes( xpath='//*[contains(@class, "f-csv")]/..' ) %>% html_attr('href') attempt2 <- read_html( myURL ) %>% html_nodes( xpath='//*[contains(@class, "files") and contains(@href, ".csv")]' ) %>% html_attr('href') 类时,我得到一个非空的列表,但是我无法进一步向下钻取。

这将返回一个非空列表:

html_nodes()

右键单击下载图标,我看到2017-11-01文件的URL(注意文件名与显示的更新时间不同)应该是:https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%2F2017%2F11%2FBy_Day%2FRTBM-LMP-DAILY-SL-20171101.csv。当我点击它时,这会为我下载CSV。

有关如何返回CSV文件的下载URL的任何想法?

1 个答案:

答案 0 :(得分:1)

该站点发出XHR请求并动态构建这些块,因此无法进行任何普通的Web抓取工作。

黑客抓住XHR请求,它使得它无法工作,或者b / c网站是半井编码的,并且具有广泛的CSRF保护。

因此,我们需要获得一个包含CSRF元数据的网站的正常页面,提取这些标记,然后向页面所针对的目标发出虚假的XHR请求。

该网站也非常统一(他们做的另一个标志是没有建立网站/应用程序)所以很容易制作一个通用功能,让你输入年,月和&#34;键入&# 34; (我根据网站上的&#39;框中的信息做了一些猜测),并将CSV文件的内容列表作为数据框返回。默认情况下,它使用当前年份和月份,默认为&#34; By_Day&#34;。

CSV需要一些下载,因此它会在下载时为每个CSV打印一条消息。您可能不需要该功能,但可以直接通过初始XHR响应来执行您所需的操作。

我尝试将依赖项保持在最低限度,但purrrdplyr会使(IMO)更好地添加。

rtbm_lmp_by_location <- function(year = format(Sys.Date(), "%Y"),
                                 month = format(Sys.Date(), "%m"),
                                 by_type = c("By_Day", "By_Interval", "RePrice")) {

  require(rvest)
  require(jsonlite)
  require(httr)

  by_type <- match.arg(by_type, c("By_Day", "By_Interval", "RePrice"))

  res <- GET("https://marketplace.spp.org/pages/rtbm-lmp-by-location")
  doc <- content(res)

  x_csrf_token <- html_attr(html_node(doc, "meta[id='_csrf']"), "content")
  x_spp_csrf_token <- html_attr(html_node(doc, "meta[id='_spp_csrf']"), "content")

  POST(
    url = "https://marketplace.spp.org/file-api/", 
    add_headers(
      Host = "marketplace.spp.org", 
      Referer = "https://marketplace.spp.org/pages/rtbm-lmp-by-location",
      `X-CSRF-TOKEN` = x_csrf_token,
      `X-SPP-CSRF-TOKEN` = x_spp_csrf_token,
      `X-Requested-With` = "XMLHttpRequest"
    ), 
    body = list(
      name = "rtbm-lmp-by-location",
      fsName = "rtbm-lmp-by-location", 
      type = "folder", 
      path = sprintf("/%s/%s/%s", year, month, by_type)
    ), 
    encode = "json"
  ) -> res

  res <- content(res, as="text")
  res <- jsonlite::fromJSON(res, flatten=TRUE)
  res$path <- sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%s",
                      res$path)

  lapply(res$path, function(.x) {
    message(sprintf("Downloading <%s>...", .x))
    read.csv(.x, stringsAsFactors=FALSE)
  })

}

fils <- rtbm_lmp_by_location()
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171101.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171102.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171103.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171104.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171105.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171106.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171107.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>...

而且,这是它撤回的数据:

str(fils)
## List of 8
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 16.2 18 21.3 15.9 21.4 ...
##   ..$ MLC                     : num [1:272448] 0.8342 0.3078 0.0295 0.5966 -0.1799 ...
##   ..$ MCC                     : num [1:272448] 0 2.34 5.94 0 6.19 ...
##   ..$ MEC                     : num [1:272448] 15.3 15.3 15.3 15.3 15.3 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 17.1 16.7 16.6 16.8 16.4 ...
##   ..$ MLC                     : num [1:272448] 0.5527 0.1549 0.0498 0.2766 -0.1663 ...
##   ..$ MCC                     : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ MEC                     : num [1:272448] 16.6 16.6 16.6 16.6 16.6 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 18.9 18.3 17.8 18.6 17.5 ...
##   ..$ MLC                     : num [1:272448] 0.819 0.191 -0.221 0.566 -0.584 ...
##   ..$ MCC                     : num [1:272448] 0 0 0 0 0 0 0 0 -0.0076 0 ...
##   ..$ MEC                     : num [1:272448] 18.1 18.1 18.1 18.1 18.1 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 0.0107 4.4038 5.2691 0.0108 5.1795 ...
##   ..$ MLC                     : num [1:272448] 3e-04 2e-04 0e+00 4e-04 -1e-04 1e-04 -1e-04 0e+00 1e-04 3e-04 ...
##   ..$ MCC                     : num [1:272448] 0 4.39 5.26 0 5.17 ...
##   ..$ MEC                     : num [1:272448] 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0105 0.0104 ...
##  $ :'data.frame': 283800 obs. of  8 variables:
##   ..$ Interval                : chr [1:283800] "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:283800] "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:283800] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:283800] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:283800] 12 14.7 18.4 12.1 18.6 ...
##   ..$ MLC                     : num [1:283800] 0.4667 0.3877 0.2521 0.5528 0.0704 ...
##   ..$ MCC                     : num [1:283800] 0.0008 2.8321 6.6661 0 7.0045 ...
##   ..$ MEC                     : num [1:283800] 11.5 11.5 11.5 11.5 11.5 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 19.6 19.1 19.1 19.5 18.7 ...
##   ..$ MLC                     : num [1:272448] 0.0621 0.0153 0.0905 0.2946 -0.2689 ...
##   ..$ MCC                     : num [1:272448] 0.6728 0.2223 0.0854 0.282 0.0761 ...
##   ..$ MEC                     : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 21.5 20.2 19.6 21.1 19.2 ...
##   ..$ MLC                     : num [1:272448] 0.232 -0.277 -0.62 0.344 -0.985 ...
##   ..$ MCC                     : num [1:272448] 0 -0.819 -1.145 -0.58 -1.156 ...
##   ..$ MEC                     : num [1:272448] 21.3 21.3 21.3 21.3 21.3 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 19 19 18.9 19.4 18.6 ...
##   ..$ MLC                     : num [1:272448] 0.1562 0.1086 0.0215 0.465 -0.3251 ...
##   ..$ MCC                     : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ MEC                     : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...

完成所有这些并查看它检索到的URL:

https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv

对于一年中的任何特定日期,你甚至都不能为所有^^和sprintf()glue变量路径组件而烦恼:

rtbm_lmp_by_location_by_day <- function(date) {
  date <- as.Date(date)
  y <- format(date, "%Y")
  m <- as.numeric(format(date, "%m"))
  ymd <- format(date, "%Y%m%d")
  sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/%s/%s/By_Day/RTBM-LMP-DAILY-SL-%s.csv",
          y, m, ymd) -> fil
  res <- httr::HEAD(fil)
  if (httr::status_code(res) != 200) {
    message("File not found")
    return(invisible(NULL))
  } else {
    message(sprintf("Downloading <%s>", fil))
    read.csv(fil, stringsAsFactors=FALSE)
  }
}

xdf <- rtbm_lmp_by_location_by_day("2017-11-08")
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>

str(xdf)
## 'data.frame': 272448 obs. of  8 variables:
##  $ Interval                : chr  "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
##  $ GMT.Interval            : chr  "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
##  $ Settlement.Location.Name: chr  "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##  $ PNODE.Name              : chr  "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##  $ LMP                     : num  19 19 18.9 19.4 18.6 ...
##  $ MLC                     : num  0.1562 0.1086 0.0215 0.465 -0.3251 ...
##  $ MCC                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MEC                     : num  18.9 18.9 18.9 18.9 18.9 ...

这是一个直截了当的过程,可以找出其他类别的模式。