向下滚动以R进行网页抓取

时间:2018-11-21 16:12:54

标签: r web-scraping

我想从以下URL下载前两列(“ GAS DAY START ON”和“ GAS IN STORAGE”):

https://agsi.gie.eu/#/historical/eu

默认期限设置为“上个月”,我需要“全部”。

有人可以告诉我我可以使用哪种软件包来完成此类任务吗? 还有一个免费的API,但是我也没有做到这一点。

感谢您的每次输入! 提前非常感谢!

1 个答案:

答案 0 :(得分:1)

让我们引导您更接近API路径。如果您有API密钥,则可以(但不应)将其直接传递给以下函数。您应该将其放在您的~/.Renviron中,为:

AGSI_KEY=thekeytheygaveyou

并重新启动您的R会话。然后它将自动使用。

以下功能带有开始/结束日期

get_agsi_data <- function(start, end, agsi_api_key = Sys.getenv("AGSI_KEY")) {

  start[1] <- as.character(as.Date(start[1]))
  end[1] <- as.character(as.Date(end)[1])

  httr::GET(
    url = "https://agsi.gie.eu/api/data/eu", # NOTE THE HARDCODING FOR eu
    httr::add_headers(`x-key` = agsi_api_key),
    httr::user_agent("user@example.com") # REPLACE THIS WITH YOUR EMAIL ADDRESS
  ) -> res

  httr::stop_for_status(res) # warns when API issues

  out <- httr::content(res, as = "text", encoding = "UTF-8")

  out <- jsonlite::fromJSON(out)

  sapply(out$info, function(x) { # the info element is an ugly list so we need to make it better
    if (length(x)) {
      x <- paste0(x, collapse = "; ") 
    } else {
      NA_character_
    }
  }) -> info

  out$info <- info

  readr::type_convert(
    df = out,
    col_types = cols(
      status = col_character(),
      gasDayStartedOn = col_date(format = ""),
      gasInStorage = col_double(),
      full = col_double(),
      trend = col_double(),
      injection = col_double(),
      withdrawal = col_double(),
      workingGasVolume = col_double(),
      injectionCapacity = col_double(),
      withdrawalCapacity = col_double()
    )
  ) -> out

  class(out) <- c("tbl_df", "tbl", "data.frame")

  out

}

xdf <- get_agsi_data("2018-06-01", "2018-10-01")

xdf
## # A tibble: 2,880 x 11
##    status gasDayStartedOn gasInStorage  full trend injection withdrawal workingGasVolume injectionCapacity
##  * <chr>  <date>                 <dbl> <dbl> <dbl>     <dbl>      <dbl>            <dbl>             <dbl>
##  1 E      2018-11-19              918.  86.1 -0.41      343.      4762.            1067.            11469.
##  2 E      2018-11-18              923.  86.5 -0.22      534.      2841.            1067.            11469.
##  3 E      2018-11-17              925.  86.7 -0.2       649.      2796.            1067.            11469.
##  4 E      2018-11-16              927.  86.9 -0.24      492.      3014.            1067.            11469.
##  5 E      2018-11-15              930.  87.1 -0.16      503.      2210.            1067.            11469.
##  6 E      2018-11-14              931.  87.3 -0.1       605.      1682.            1067.            11469.
##  7 E      2018-11-13              933.  87.4 -0.07      651.      1438.            1067.            11469.
##  8 E      2018-11-12              933.  87.5 -0.05      833.      1391.            1067.            11468.
##  9 E      2018-11-11              934.  87.5  0.09     1607.       659.            1067.            11478.
## 10 E      2018-11-10              933.  87.4  0.06     1458.       796.            1067.            11478.
## # ... with 2,870 more rows, and 2 more variables: withdrawalCapacity <dbl>, info <chr>

eu是经过硬编码的,但是对于其他API端点而言,它却很容易扩充:

enter image description here