rvest()中是否有办法在给定网站中搜索包含关键字的网页?
例如,根据网址http://umich.edu/,我是否可以返回包含“教师”一词的包含网页列表?
我对rvest和网络抓取都相当新,所以我不确定如何解决这类问题。非常感谢你!
编辑:我正在寻找包含“faculty”一词的页面的链接
答案 0 :(得分:0)
尝试创建获取主页中所有链接的函数,并检查其中是否包含单词" faculty"。下面是一个简单的脚本,可以构成一个不错的起点:
library(rvest)
library(curl)
# Define function to get all the links in the main page (except for social media sites)
get_all_links <- function(URL) {
links <- URL %>% read_html() %>%
html_nodes(xpath = "//a") %>%
html_attr("href")
# Remove links starting with # or those that only contain /
links <- links[!(links == "/" | grepl("#", links))]
# Remove social media sites (Linkedin, Facebook, Twitter, etc.)
social_media <- "facebook|linkedin|twitter|instagram|pinterest|youtube"
links <- links[!(grepl(social_media, links, ignore.case = TRUE))]
# Return links (prepended with main url if they are relative links)
ifelse(startsWith(links, "http"), links, paste0(URL,links))
}
# Define function to check if a link's page contains the word "faculty"
is_faculty_link <- function(URL) {
# Use a user agent to identify your crawler
URL <- curl(URL, handle = new_handle("useragent" = "Mozilla/5.0"))
# Get the content of the page
page_content <- URL %>% read_html() %>%
html_text()
# Check if the page contains the word faculty (regardless of case sensitivity)
return(grepl("faculty", page_content, ignore.case = TRUE))
}
michigan_url <- "http://umich.edu"
michigan_links <- get_all_links(michigan_url)
faculty_links <- unlist(lapply(michigan_links, is_faculty_link))
faculty_links <- michigan_links[faculty_links]
[1] "https://email.med.umich.edu"
[2] "http://mirlyn.lib.umich.edu"
[3] "http://campusinfo.umich.edu/mapsAndDirections"
[4] "http://umich.edu/schools-colleges/"
[5] "https://wolverineaccess.umich.edu"
[6] "http://umich.edu/prospective-students/"
[7] "http://umich.edu/current-students/"
[8] "http://umich.edu/faculty-staff/"
[9] "http://umich.edu/parents/"
[10] "http://umich.edu/alumni/"
[11] "http://umich.edu/about/"
[12] "http://umich.edu/academics/"
[13] "http://umich.edu/life-at-michigan/"
[14] "http://umich.edu/athletics/"
[15] "http://umich.edu/research/"
[16] "http://umich.edu/health-medicine/"
[17] "http://umich.edu/initiatives/"
[18] "http://umich.edu/giving/"
[19] "http://diversity.umich.edu/?features=see-the-video-recap-of-u-ms-historic-diversity-equity-and-inclusion-launch"
[20] "http://www.engin.umich.edu/college/about/news/stories/2016/september/wringing-power-from-water"
[21] "http://umich.edu/schools-colleges/"
[22] "http://admissions.umich.edu/academics-majors/majors-degrees"
[23] "http://record.umich.edu/articles/us-news-releases-its-latest-graduate-program-rankings"
[24] "http://global.umich.edu/"
[25] "http://record.umich.edu/articles/special-alumni-award-helps-highlight-u-ms-past-herald-its-future"
[26] "https://lsa.umich.edu/lsa/news-events/all-news/search-news/starslaughter.html"
[27] "http://umdearborn.edu/"
[28] "http://umflint.edu/"
[29] "http://umich.edu/contact/"
[30] "http://obp.umich.edu/root/budget/state-required-reports/"