我知道有免费的网站内容监视程序,可以在网站内容更改时发送电子邮件警报,但是R中是否有可以执行此操作的软件包(或任何硬编码方式)?将其集成到一个工作流程中将很有帮助。
答案 0 :(得分:1)
R是一种通用编程语言,因此您可以使用它进行任何操作。
您想做的核心成语是:
n
时间段过去了(您需要弄清楚这一点:cron?启动了吗?Amazon lambda?)对于内容,您可能不知道httr::GET()
返回了一个充满元数据的丰富,复杂的数据对象。我没有在下面做str(res)
来鼓励您自己这样做。
library(httr)
library(rvest)
library(splashr)
library(hgr) # devtools::install_github("hrbrmstr/hgr")
library(tlsh) # devtools::install_github("hrbrmstr/tlsh")
target_url <- "https://www.whitehouse.gov/briefings-statements/"
像浏览器一样获取它
httr::GET(
url = target_url,
httr::user_agent(splashr::ua_macos_safari)
) -> res
缓存页面大小,并使用实质性差异来通知通知
(page_size <- res$headers['content-length'])
## $`content-length`
## [1] "12783"
使用tlsh_simple_diff()
计算并缓存本地敏感哈希值,以查看是否存在“实质性”哈希变化并将其用作通知的信号:
doc_text <- httr::content(res, as = "text")
(doc_hash <- tlsh_simple_hash(doc_text))
## [1] "563386E33C44683E060B739261ADF20CB2D38563EE151C88A3F95169999FF97A1F385D"
该网站使用结构化的<div>
这样的缓存,并使用更多/更少/不同的来发出通知:
doc <- httr::content(res)
news_items <- html_nodes(doc, "div.briefing-statement__content")
(total_news_items <- length(news_items))
## [1] 10
(headlines <- gsub("[[:space:]]+", " ", html_text(news_items, trim=TRUE)))
## [1] "News Clips CNBC: “Job Openings Hit Record 7.136 Million in August” Economy & Jobs Oct 16, 2018"
## [2] "Fact Sheets Congressional Democrats Want to Take Away Your Doctor, Outlaw Your Private Insurance, and Put Bureaucrats In Charge of Your Healthcare Healthcare Oct 16, 2018"
## [3] "Remarks Remarks by President Trump in Briefing on Hurricane Michael Land & Agriculture Oct 15, 2018"
## [4] "Remarks Remarks by President Trump and Governor Scott at FEMA Aid Distribution Center | Lynn Haven, FL Land & Agriculture Oct 15, 2018"
## [5] "Remarks Remarks by President Trump During Tour of Lynn Haven Community | Lynn Haven, FL Land & Agriculture Oct 15, 2018"
## [6] "Remarks Remarks by President Trump and Governor Scott Upon Arrival in Florida Land & Agriculture Oct 15, 2018"
## [7] "Remarks Remarks by President Trump Before Marine One Departure Foreign Policy Oct 15, 2018"
## [8] "Statements & Releases White House Appoints 2018-2019 Class of White House Fellows Oct 15, 2018"
## [9] "Statements & Releases President Donald J. Trump Approves Georgia Disaster Declaration Land & Agriculture Oct 14, 2018"
## [10] "Statements & Releases President Donald J. Trump Amends Florida Disaster Declaration Land & Agriculture Oct 14, 2018"
使用“可读性”工具将内容转换为纯文本缓存,并与众多“文本差异/字符串差异” R软件包之一进行比较:
content_meta <- hgr::just_the_facts(target_url)
str(content_meta)
## List of 11
## $ title : chr "Briefings & Statements"
## $ content : chr "<p class=\"body-overflow\"> <header class=\"header\"> </header>\n<main id=\"main-content\"> <div class=\"page-r"| __truncated__
## $ lead_image_url: chr "https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png"
## $ next_page_url : chr "https://www.whitehouse.gov/briefings-statements/page/2"
## $ url : chr "https://www.whitehouse.gov/briefings-statements/"
## $ domain : chr "www.whitehouse.gov"
## $ excerpt : chr "Get official White House briefings, statements, and remarks from President Donald J. Trump and members of his Administration."
## $ word_count : int 22
## $ direction : chr "ltr"
## $ total_pages : int 2
## $ pages_rendered: int 2
## - attr(*, "row.names")= int 1
## - attr(*, "class")= chr "hgr"
不幸的是,您问了一个通用计算式的问题,因此很可能会被关闭。