在R

时间:2016-09-29 10:12:47

标签: r scrape bubble-chart

我想了解美国联邦政府正在开发什么样的机器学习应用程序。联邦政府维护包含合同的网站FedBizOps。可以在网站上搜索短语,例如"机器学习"和日期范围,例如"过去365天"找到相关合同。生成的搜索会生成包含合同摘要的链接。

我希望能够从此网站提取合同摘要,并给出搜索字词和日期范围。

有什么方法可以将浏览器渲染的数据刮到R?有关网页抓取的类似问题exists,但我不知道如何更改日期范围。

将信息提取到R后,我希望将摘要与关键短语的气泡图进行整理。

1 个答案:

答案 0 :(得分:3)

这可能看起来像是通过javascript使用XHR来检索网址内容的网站,但事实并非如此。它只是一个简单的网站,可以通过标准的rvest& xml2调用html_sessionread_html。它 保持Location:网址相同,所以它看起来像XHR甚至认为它不是。

但是,这是一个基于<form>的网站,这意味着您可以对社区慷慨,并为&#34;隐藏&#34;写一个R包装器。 API并可能将其捐赠给rOpenSci。

为此,我在&#34; Copy as cURL&#34;上使用了curlconverter包。来自POST请求的内容,它提供了所有表单字段(似乎映射到高级搜索页面上的大多数 - 如果不是全部 - 字段):

library(curlconverter)

make_req(straighten())[[1]] -> req

httr::VERB(verb = "POST", url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
    httr::add_headers(Pragma = "no-cache", 
        Origin = "https://www.fbo.gov", 
        `Accept-Encoding` = "gzip, deflate, br", 
        `Accept-Language` = "en-US,en;q=0.8", 
        `Upgrade-Insecure-Requests` = "1", 
        `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.41 Safari/537.36", 
        Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
        `Cache-Control` = "no-cache", 
        Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
        Connection = "keep-alive", 
        DNT = "1"), httr::set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
        sympcsm_cookies_enabled = "1", 
        BALANCEID = "balancer.172.16.121.7"), 
    body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning", 
        `dnf_class_values[procurement_notice][_posted_date]` = "365", 
        search_filters = "search", 
        `_____dummy` = "dnf_", 
        so_form_prefix = "dnf_", 
        dnf_opt_action = "search", 
        dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m", 
        dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+", 
        dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9", 
        dnf_opt_finalize = "0", 
        dnf_opt_mode = "update", 
        dnf_opt_target = "", dnf_opt_validate = "1", 
        `dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice", 
        `dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32", 
        `dnf_class_values[procurement_notice][posted]` = "", 
        `autocomplete_input_dnf_class_values[procurement_notice][agency]` = "", 
        `dnf_class_values[procurement_notice][agency]` = "", 
        `dnf_class_values[procurement_notice][zipstate]` = "", 
        `dnf_class_values[procurement_notice][procurement_type][]` = "", 
        `dnf_class_values[procurement_notice][set_aside][]` = "", 
        mode = "list"), encode = "form")

curlconverterhttr::前缀添加到各种函数中,因为您实际上可以使用req()来发出请求。它是一个真正的R函数。

但是,传入的大部分数据都是浏览器&#34; cruft&#34;并且可以稍微减少并移入POST请求:

library(httr)
library(rvest)

POST(url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
     add_headers(Origin = "https://www.fbo.gov", 
                 Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list"), 
     set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
                 sympcsm_cookies_enabled = "1", 
                 BALANCEID = "balancer.172.16.121.7"), 
     body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning", 
                 `dnf_class_values[procurement_notice][_posted_date]` = "365", 
                 search_filters = "search", 
                 `_____dummy` = "dnf_", 
                 so_form_prefix = "dnf_", 
                 dnf_opt_action = "search", 
                 dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m", 
                 dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+", 
                 dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9", 
                 dnf_opt_finalize = "0", 
                 dnf_opt_mode = "update", 
                 dnf_opt_target = "", dnf_opt_validate = "1", 
                 `dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice", 
                 `dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32", 
                 `dnf_class_values[procurement_notice][posted]` = "", 
                 `autocomplete_input_dnf_class_values[procurement_notice][agency]` = "", 
                 `dnf_class_values[procurement_notice][agency]` = "", 
                 `dnf_class_values[procurement_notice][zipstate]` = "", 
                 `dnf_class_values[procurement_notice][procurement_type][]` = "", 
                 `dnf_class_values[procurement_notice][set_aside][]` = "",
                 mode="list"), 
     encode = "form") -> res

这部分:

     set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
                 sympcsm_cookies_enabled = "1", 
                 BALANCEID = "balancer.172.16.121.7")

让我觉得你应该在主URL上至少使用一次html_sessionGET来在缓存的curl处理程序中建立这些cookie(这些处理程序将自动创建和维护)你)。

add_headers()位也可能没有必要,但这是为读者留下的练习。

您可以通过以下方式找到您要找的表:

content(res, as="text", encoding="UTF-8") %>% 
  read_html() %>% 
  html_nodes("table.list") %>% 
  html_table() %>% 
  dplyr::glimpse()
## Observations: 20
## Variables: 4
## $ Opportunity            <chr> "NSN: 1650-01-074-1054; FILTER ELEMENT, FLUID; WSIC: L SP...
## $ Agency/Office/Location <chr> "Defense Logistics Agency DLA Acquisition LocationsDLA Av...
## $ Type /  Set-aside      <chr> "Presolicitation", "Presolicitation", "Award", "Award", "...
## $ Posted On              <chr> "Sep 28, 2016", "Sep 28, 2016", "Sep 28, 2016", "Sep 28, ...

页面上有一个指示符,表示这些是&#34; 2008年1月20日&#34; 的结果。你需要抓住它并处理分页结果。这也留给读者练习。