Question

我是html的新手，但是正在玩一个脚本来下载给定网页链接的所有PDF个文件（为了好玩和避免无聊的手动工作），我无法找到在html文档中，我应该查找完成相对路径的数据 - 我知道这是可能的，因为我的Web浏览器可以执行此操作。

示例：我试图使用R包rvest查看与this page from ocw.mit.edu相关联的讲义，查看原始html或访问href“节点的a属性”我只得到相对路径：

library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
  "electrical-engineering-and-computer-science/",
  "6-006-introduction-to-algorithms-fall-2011/lecture-notes/")

# Read webpage and extract all links
links_all <- read_html(url)  %>% 
  html_nodes("a") %>%
  html_attr("href")

# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1] 
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"

Answer 1

到目前为止，我发现的最简单的解决方案是使用url_absolute(x, base)包的xml2函数。对于基本参数，您使用从中检索源的页面的URL。

与尝试通过regexp提取地址的基本url相比，这似乎更容易出错。

使用R

1 个答案: