R中是否有任何网站内容监视软件包?

时间:2018-10-16 16:42:30

标签: r email url

我知道有免费的网站内容监视程序,可以在网站内容更改时发送电子邮件警报,但是R中是否有可以执行此操作的软件包(或任何硬编码方式)?将其集成到一个工作流程中将很有帮助。

1 个答案:

答案 0 :(得分:1)

R是一种通用编程语言,因此您可以使用它进行任何操作

您想做的核心成语是:

  • 确定目标站点
  • 提取内容和内容元数据
  • 缓存^^(您需要弄清楚; RDBMS表?NoSQL表?文件?)
  • n时间段过去了(您需要弄清楚这一点:cron?启动了吗?Amazon lambda?)
  • 提取内容和内容元数据
  • 将^^与缓存版本进行比较;注意,如果您知道目标站点的结构而不是使用过于通用的框架,则此方法最有效)
  • 如果差异“很大”,请通过任何方式通知您(您需要弄清楚这一点:电子邮件,SMS或Twitter?)

对于内容,您可能不知道httr::GET()返回了一个充满元数据的丰富,复杂的数据对象。我没有在下面做str(res)来鼓励您自己这样做。

library(httr)
library(rvest)
library(splashr)
library(hgr) # devtools::install_github("hrbrmstr/hgr")
library(tlsh) # devtools::install_github("hrbrmstr/tlsh")

target_url <- "https://www.whitehouse.gov/briefings-statements/"

像浏览器一样获取它

httr::GET(
  url = target_url,
  httr::user_agent(splashr::ua_macos_safari)
) -> res

缓存页面大小,并使用实质性差异来通知通知

(page_size <- res$headers['content-length'])
## $`content-length`
## [1] "12783"

使用tlsh_simple_diff()计算并缓存本地敏感哈希值,以查看是否存在“实质性”哈希变化并将其用作通知的信号:

doc_text <- httr::content(res, as = "text")

(doc_hash <- tlsh_simple_hash(doc_text))
## [1] "563386E33C44683E060B739261ADF20CB2D38563EE151C88A3F95169999FF97A1F385D"

该网站使用结构化的<div>这样的缓存,并使用更多/更少/不同的来发出通知:

doc <- httr::content(res)

news_items <- html_nodes(doc, "div.briefing-statement__content")

(total_news_items <- length(news_items))
## [1] 10

(headlines <- gsub("[[:space:]]+", " ", html_text(news_items, trim=TRUE)))
##  [1] "News Clips CNBC: “Job Openings Hit Record 7.136 Million in August” Economy & Jobs Oct 16, 2018"                                                                            
##  [2] "Fact Sheets Congressional Democrats Want to Take Away Your Doctor, Outlaw Your Private Insurance, and Put Bureaucrats In Charge of Your Healthcare Healthcare Oct 16, 2018"
##  [3] "Remarks Remarks by President Trump in Briefing on Hurricane Michael Land & Agriculture Oct 15, 2018"                                                                       
##  [4] "Remarks Remarks by President Trump and Governor Scott at FEMA Aid Distribution Center | Lynn Haven, FL Land & Agriculture Oct 15, 2018"                                    
##  [5] "Remarks Remarks by President Trump During Tour of Lynn Haven Community | Lynn Haven, FL Land & Agriculture Oct 15, 2018"                                                   
##  [6] "Remarks Remarks by President Trump and Governor Scott Upon Arrival in Florida Land & Agriculture Oct 15, 2018"                                                             
##  [7] "Remarks Remarks by President Trump Before Marine One Departure Foreign Policy Oct 15, 2018"                                                                                
##  [8] "Statements & Releases White House Appoints 2018-2019 Class of White House Fellows Oct 15, 2018"                                                                            
##  [9] "Statements & Releases President Donald J. Trump Approves Georgia Disaster Declaration Land & Agriculture Oct 14, 2018"                                                     
## [10] "Statements & Releases President Donald J. Trump Amends Florida Disaster Declaration Land & Agriculture Oct 14, 2018"      

使用“可读性”工具将内容转换为纯文本缓存,并与众多“文本差异/字符串差异” R软件包之一进行比较:

content_meta <- hgr::just_the_facts(target_url)

str(content_meta)
## List of 11
##  $ title         : chr "Briefings & Statements"
##  $ content       : chr "<p class=\"body-overflow\"> <header class=\"header\"> </header>\n<main id=\"main-content\"> <div class=\"page-r"| __truncated__
##  $ lead_image_url: chr "https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png"
##  $ next_page_url : chr "https://www.whitehouse.gov/briefings-statements/page/2"
##  $ url           : chr "https://www.whitehouse.gov/briefings-statements/"
##  $ domain        : chr "www.whitehouse.gov"
##  $ excerpt       : chr "Get official White House briefings, statements, and remarks from President Donald J. Trump and members of his Administration."
##  $ word_count    : int 22
##  $ direction     : chr "ltr"
##  $ total_pages   : int 2
##  $ pages_rendered: int 2
##  - attr(*, "row.names")= int 1
##  - attr(*, "class")= chr "hgr"

不幸的是,您问了一个通用计算式的问题,因此很可能会被关闭。