使用r

时间:2017-11-28 20:10:36

标签: html r web-scraping rvest httr

我正在开展一个网页抓取项目,可以从此网页下载各种csv文件:https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link

我希望能够以编程方式在下拉列表中选择各种报告的季度,点击提交(请注意页面的URL不会针对每个不同的季度更改),然后为每个季度“下载CSV”

作为免责声明,我是rvest的新手,以下是我尝试解决方案:

  1. 我首先查看了此网站,发现了相关帖子Using r to navigate and scrape a webpage with drop down html forms

  2. 看起来他们使用以下代码来获取刷新HTML表所需输入的表单:

    pgsession <- html_session(url)
    pgform <-html_form(pgsession)[[3]]
    filled_form <-set_values(pgform,
            "team" = "ALL",
            "week" = "1",
            "pos"  = "ALL",
            "year" = "2015"      
     )
    
     submit_form(session=pgsession,form=filled_form, POST=url)
    
  3. 我尝试为上面的网站执行此操作,而是获得以下内容

     > html_form(html_session("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link"))
     [[1]]
     <form> '<unnamed>' (GET )
      <input text> '': 
      <select> '' [1/7]
    
     [[2]]
     <form> 'frm_registration' (POST /filer/registration)
      <input hidden> 'permalink': blue-harbour-group-lp
      <input hidden> 'registration_type': register
      <input text> 'user_email': 
    
     [[3]]
     <form> 'frm-report-error' (POST /filer/report_error)
      <input hidden> 'permalink': blue-harbour-group-lp
      <input text> 'user_name': 
      <input text> 'user_email': 
      <textarea> 'comments' [0 char]
      <textarea> 'g-recaptcha-response' [0 char]
    
  4. 我没有完全看到相同的设置,下拉选项的唯一形式是[1]和[1/7]选项,但我不知道这是指什么

  5. 比较两个网站的源代码,似乎我有一个“表单控件”类,我应该提取?我怎么做? Site source code

  6. 最后,在使用其他季度刷新表后,如何下载CSV?是否可以直接从网站上阅读csv而无需下载文件?

  7. 谢谢!

1 个答案:

答案 0 :(得分:2)

+1使用开发人员工具。该工具/技能将为您提供良好的服务。

您应该认真考虑使用API​​。但您可以将httrrvest一起用于此(并且我验证了它不符合网站规则):

library(rvest)
library(httr)
library(tidyverse)

我们将首先获取页面,因为我们需要抓取弹出菜单数据:

pg <- read_html("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")

qtr_nodes <- html_nodes(pg, "select[id='quarter_one'] option")

data_frame(
  qtr = html_text(qtr_nodes),
  value = html_attr(qtr_nodes, "value")
) %>% 
  filter(!grepl("ubscri", qtr)) -> qtrs

qtrs
## # A tibble: 10 x 2
##                           qtr value
##                         <chr> <chr>
##  1 Current Combined 13F/13D/G    -1
##  2       Q3 2017 13F Filings     67
##  3       Q2 2017 13F Filings     66
##  4       Q1 2017 13F Filings     65
##  5       Q4 2016 13F Filings     64
##  6       Q3 2016 13F Filings     63
##  7       Q2 2016 13F Filings     62
##  8       Q1 2016 13F Filings     61
##  9       Q4 2015 13F Filings     60
## 10       Q3 2015 13F Filings     59

^^是从漂亮名称到弹出值的转换表。该值是提交幕后发生的XHR请求所必需的。

让我们创建一个模拟XHR请求的函数:

get_qtr <- function(qtr) {

  GET(
    url = "https://whalewisdom.com/filer/holdings", 
    httr::add_headers(
      Host = "whalewisdom.com", 
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:58.0) Gecko/20100101 Firefox/58.0", 
      Accept = "application/json, text/javascript, */*; q=0.01", 
      `Accept-Language` = "en-US,en;q=0.5", 
      Referer = "https://whalewisdom.com/filer/blue-harbour-group-lp", 
      `X-Requested-With` = "XMLHttpRequest", 
      Connection = "keep-alive"
    ),
    query = list(
      q1 = qtr, 
      id = "384", type_filter = "1,2,3,4", symbol = "", 
      change_filter = "1,2,3,4,5", minimum_ranking = "", minimum_shares = "", 
      is_etf = "0", sc = "true", `_search` = "false", rows = "25", 
      page = "1", sidx = "current_ranking", sord = "asc"
    )
  ) -> res

  stop_for_status(res)

  res <- content(res)

  map_df(res$rows, ~map(.x, ~ifelse(is.null(.x), NA, .x)))

}

我们只是将value传递给qtr参数,但您也可以为其他位添加参数。

现在,使用上面的转换表来获取随机选择的数据集:

qtr_65 <- get_qtr(65)

glimpse(qtr_65)
## Observations: 19
## Variables: 24
## $ id                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ name                          <chr> "Investors Bancorp Inc", "Xilinx, Inc", "BWX TECHNOLOGIES INC", "AGCO Corp", "A...
## $ symbol                        <chr> "ISBC", "XLNX", "BWXT", "AGCO", "AVT", "WBMD", "AKAM", "RDC", "FFIV", "ADNT", "...
## $ permalink                     <chr> "isbc", "xlnx", "bwxt", "agco", "avt", "wbmd", "akam", "rdc", "ffiv", "adnt", "...
## $ security_type                 <chr> "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "...
## $ stock_id                      <int> 5284, 930, 7803, 600, 375, 838, 3527, 3658, 26, 198034, 5045, 72934, 812, 4116,...
## $ source_date                   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ source_type                   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ sector                        <chr> "FINANCE", "INFORMATION TECHNOLOGY", "INDUSTRIALS", "INDUSTRIALS", "INFORMATION...
## $ industry                      <chr> "TRUSTS & THRIFTS", "SEMICONDUCTORS", "ELECTRICAL EQUIPMENT", "MACHINERY", "ELE...
## $ current_shares                <int> 29582428, 6058693, 5287927, 3813700, 4363874, 3361336, 2855493, 10542812, 93835...
## $ previous_shares               <int> 29582428, 7514437, 10561086, 6835700, 5415074, 1795914, 2474193, 10542812, 8599...
## $ shares_change                 <int> 0, -1455744, -5273159, -3022000, -1051200, 1565422, 381300, 0, 78373, 1675570, ...
## $ position_change_type          <chr> NA, "reduction", "reduction", "reduction", "reduction", "addition", "addition",...
## $ percent_shares_change         <chr> "0.0", "-19.3726", "-49.9301", "-44.2091", "-19.4125", "87.1658", "15.4111", "0...
## $ current_ranking               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 999999, 999999, 999999
## $ previous_ranking              <int> 3, 1, 2, 4, 5, 11, 7, 6, 8, 999999, 10, 9, 999999, 12, 14, 13, 15, 16, 17
## $ current_percent_of_portfolio  <dbl> 15.3264, 12.6366, 9.0686, 8.2689, 7.1946, 6.3798, 6.1419, 5.9180, 4.8200, 4.387...
## $ previous_percent_of_portfolio <dbl> 13.7987, 15.1687, 14.0194, 13.2249, 8.6205, 2.9767, 5.5165, 6.6592, 4.1615, NA,...
## $ current_mv                    <chr> "425395000.0", "350738000.0", "251705000.0", "229508000.0", "199691000.0", "177...
## $ previous_mv                   <chr> "412675000.0", "453647000.0", "419275000.0", "395514000.0", "257812000.0", "890...
## $ percent_ownership             <chr> "26.3298285", "2.4338473", "5.3287767", "4.5501506", "3.3856140", "8.6721677", ...
## $ quarter_first_owned           <chr> "Q1 2014", "Q1 2015", "Q4 2013", "Q2 2014", "Q2 2015", "Q4 2016", "Q3 2016", "Q...
## $ quarter_id_owned              <int> 53, 57, 52, 54, 58, 64, 63, 53, 61, 65, 47, 63, 65, 64, 64, 64, 64, 60, 64

我不知道^^是否是CSV中的内容,因为我没有注册帐户,但您可以验证并希望修改。