这个问题类似于我提出的问题here。我已经尝试了答案中的技巧,但由于某种原因,我没有得到library(stringr)
data.frame(str_match(df$val, "\\{([\\d,]+)(?:\\},\\{)?([\\d,]+)?\\}")[,-1])
的匹配,我正在寻找一些额外的指导。
我正在尝试从以下网页抓取所有可下载CSV文件的网址:Page with CSV Files。这是我到目前为止所尝试的内容。
col1 col2
1 36415 9904
2 36415 85610
3 85025,36415 <NA>
4 36415 36415
5 85610 36415
6 8872 36415
X1 X2
1 36415 9904
2 36415 85610
3 85025,36415 <NA>
4 36415 36415
5 85610 36415
6 8872 36415
两次尝试都返回空字符向量。对xpath
的调用都返回零元素列表。
当我在页面中搜索library(rvest)
library(dplyr)
myURL <- 'https://marketplace.spp.org/pages/rtbm-lmp-by-location#%2F2017%2F11%2FBy_Day'
attempt1 <- read_html( myURL ) %>%
html_nodes( xpath='//*[contains(@class, "f-csv")]/..' ) %>%
html_attr('href')
attempt2 <- read_html( myURL ) %>%
html_nodes( xpath='//*[contains(@class, "files") and contains(@href, ".csv")]' ) %>%
html_attr('href')
类时,我得到一个非空的列表,但是我无法进一步向下钻取。
这将返回一个非空列表:
html_nodes()
右键单击下载图标,我看到2017-11-01文件的URL(注意文件名与显示的更新时间不同)应该是:https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%2F2017%2F11%2FBy_Day%2FRTBM-LMP-DAILY-SL-20171101.csv。当我点击它时,这会为我下载CSV。
有关如何返回CSV文件的下载URL的任何想法?
答案 0 :(得分:1)
该站点发出XHR请求并动态构建这些块,因此无法进行任何普通的Web抓取工作。
黑客抓住XHR请求,它使得它无法工作,或者b / c网站是半井编码的,并且具有广泛的CSRF保护。
因此,我们需要获得一个包含CSRF元数据的网站的正常页面,提取这些标记,然后向页面所针对的目标发出虚假的XHR请求。
该网站也非常统一(他们做的另一个标志是没有建立网站/应用程序)所以很容易制作一个通用功能,让你输入年,月和&#34;键入&# 34; (我根据网站上的&#39;框中的信息做了一些猜测),并将CSV文件的内容列表作为数据框返回。默认情况下,它使用当前年份和月份,默认为&#34; By_Day
&#34;。
CSV需要一些下载,因此它会在下载时为每个CSV打印一条消息。您可能不需要该功能,但可以直接通过初始XHR响应来执行您所需的操作。
我尝试将依赖项保持在最低限度,但purrr
和dplyr
会使(IMO)更好地添加。
rtbm_lmp_by_location <- function(year = format(Sys.Date(), "%Y"),
month = format(Sys.Date(), "%m"),
by_type = c("By_Day", "By_Interval", "RePrice")) {
require(rvest)
require(jsonlite)
require(httr)
by_type <- match.arg(by_type, c("By_Day", "By_Interval", "RePrice"))
res <- GET("https://marketplace.spp.org/pages/rtbm-lmp-by-location")
doc <- content(res)
x_csrf_token <- html_attr(html_node(doc, "meta[id='_csrf']"), "content")
x_spp_csrf_token <- html_attr(html_node(doc, "meta[id='_spp_csrf']"), "content")
POST(
url = "https://marketplace.spp.org/file-api/",
add_headers(
Host = "marketplace.spp.org",
Referer = "https://marketplace.spp.org/pages/rtbm-lmp-by-location",
`X-CSRF-TOKEN` = x_csrf_token,
`X-SPP-CSRF-TOKEN` = x_spp_csrf_token,
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
name = "rtbm-lmp-by-location",
fsName = "rtbm-lmp-by-location",
type = "folder",
path = sprintf("/%s/%s/%s", year, month, by_type)
),
encode = "json"
) -> res
res <- content(res, as="text")
res <- jsonlite::fromJSON(res, flatten=TRUE)
res$path <- sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%s",
res$path)
lapply(res$path, function(.x) {
message(sprintf("Downloading <%s>...", .x))
read.csv(.x, stringsAsFactors=FALSE)
})
}
fils <- rtbm_lmp_by_location()
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171101.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171102.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171103.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171104.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171105.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171106.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171107.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>...
而且,这是它撤回的数据:
str(fils)
## List of 8
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 16.2 18 21.3 15.9 21.4 ...
## ..$ MLC : num [1:272448] 0.8342 0.3078 0.0295 0.5966 -0.1799 ...
## ..$ MCC : num [1:272448] 0 2.34 5.94 0 6.19 ...
## ..$ MEC : num [1:272448] 15.3 15.3 15.3 15.3 15.3 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 17.1 16.7 16.6 16.8 16.4 ...
## ..$ MLC : num [1:272448] 0.5527 0.1549 0.0498 0.2766 -0.1663 ...
## ..$ MCC : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ MEC : num [1:272448] 16.6 16.6 16.6 16.6 16.6 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 18.9 18.3 17.8 18.6 17.5 ...
## ..$ MLC : num [1:272448] 0.819 0.191 -0.221 0.566 -0.584 ...
## ..$ MCC : num [1:272448] 0 0 0 0 0 0 0 0 -0.0076 0 ...
## ..$ MEC : num [1:272448] 18.1 18.1 18.1 18.1 18.1 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 0.0107 4.4038 5.2691 0.0108 5.1795 ...
## ..$ MLC : num [1:272448] 3e-04 2e-04 0e+00 4e-04 -1e-04 1e-04 -1e-04 0e+00 1e-04 3e-04 ...
## ..$ MCC : num [1:272448] 0 4.39 5.26 0 5.17 ...
## ..$ MEC : num [1:272448] 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0105 0.0104 ...
## $ :'data.frame': 283800 obs. of 8 variables:
## ..$ Interval : chr [1:283800] "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:283800] "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:283800] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:283800] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:283800] 12 14.7 18.4 12.1 18.6 ...
## ..$ MLC : num [1:283800] 0.4667 0.3877 0.2521 0.5528 0.0704 ...
## ..$ MCC : num [1:283800] 0.0008 2.8321 6.6661 0 7.0045 ...
## ..$ MEC : num [1:283800] 11.5 11.5 11.5 11.5 11.5 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 19.6 19.1 19.1 19.5 18.7 ...
## ..$ MLC : num [1:272448] 0.0621 0.0153 0.0905 0.2946 -0.2689 ...
## ..$ MCC : num [1:272448] 0.6728 0.2223 0.0854 0.282 0.0761 ...
## ..$ MEC : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 21.5 20.2 19.6 21.1 19.2 ...
## ..$ MLC : num [1:272448] 0.232 -0.277 -0.62 0.344 -0.985 ...
## ..$ MCC : num [1:272448] 0 -0.819 -1.145 -0.58 -1.156 ...
## ..$ MEC : num [1:272448] 21.3 21.3 21.3 21.3 21.3 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 19 19 18.9 19.4 18.6 ...
## ..$ MLC : num [1:272448] 0.1562 0.1086 0.0215 0.465 -0.3251 ...
## ..$ MCC : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ MEC : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...
完成所有这些并查看它检索到的URL:
https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv
对于一年中的任何特定日期,你甚至都不能为所有^^和sprintf()
或glue
变量路径组件而烦恼:
rtbm_lmp_by_location_by_day <- function(date) {
date <- as.Date(date)
y <- format(date, "%Y")
m <- as.numeric(format(date, "%m"))
ymd <- format(date, "%Y%m%d")
sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/%s/%s/By_Day/RTBM-LMP-DAILY-SL-%s.csv",
y, m, ymd) -> fil
res <- httr::HEAD(fil)
if (httr::status_code(res) != 200) {
message("File not found")
return(invisible(NULL))
} else {
message(sprintf("Downloading <%s>", fil))
read.csv(fil, stringsAsFactors=FALSE)
}
}
xdf <- rtbm_lmp_by_location_by_day("2017-11-08")
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>
str(xdf)
## 'data.frame': 272448 obs. of 8 variables:
## $ Interval : chr "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
## $ GMT.Interval : chr "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
## $ Settlement.Location.Name: chr "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## $ PNODE.Name : chr "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## $ LMP : num 19 19 18.9 19.4 18.6 ...
## $ MLC : num 0.1562 0.1086 0.0215 0.465 -0.3251 ...
## $ MCC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MEC : num 18.9 18.9 18.9 18.9 18.9 ...
这是一个直截了当的过程,可以找出其他类别的模式。