我正在开展一个网页抓取项目,可以从此网页下载各种csv文件:https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link
我希望能够以编程方式在下拉列表中选择各种报告的季度,点击提交(请注意页面的URL不会针对每个不同的季度更改),然后为每个季度“下载CSV”
作为免责声明,我是rvest的新手,以下是我尝试解决方案:
我首先查看了此网站,发现了相关帖子Using r to navigate and scrape a webpage with drop down html forms
看起来他们使用以下代码来获取刷新HTML表所需输入的表单:
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[3]]
filled_form <-set_values(pgform,
"team" = "ALL",
"week" = "1",
"pos" = "ALL",
"year" = "2015"
)
submit_form(session=pgsession,form=filled_form, POST=url)
我尝试为上面的网站执行此操作,而是获得以下内容
> html_form(html_session("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link"))
[[1]]
<form> '<unnamed>' (GET )
<input text> '':
<select> '' [1/7]
[[2]]
<form> 'frm_registration' (POST /filer/registration)
<input hidden> 'permalink': blue-harbour-group-lp
<input hidden> 'registration_type': register
<input text> 'user_email':
[[3]]
<form> 'frm-report-error' (POST /filer/report_error)
<input hidden> 'permalink': blue-harbour-group-lp
<input text> 'user_name':
<input text> 'user_email':
<textarea> 'comments' [0 char]
<textarea> 'g-recaptcha-response' [0 char]
我没有完全看到相同的设置,下拉选项的唯一形式是[1]和[1/7]选项,但我不知道这是指什么
最后,在使用其他季度刷新表后,如何下载CSV?是否可以直接从网站上阅读csv而无需下载文件?
谢谢!
答案 0 :(得分:2)
+1使用开发人员工具。该工具/技能将为您提供良好的服务。
您应该认真考虑使用API。但您可以将httr
和rvest
一起用于此(并且我验证了它不符合网站规则):
library(rvest)
library(httr)
library(tidyverse)
我们将首先获取页面,因为我们需要抓取弹出菜单数据:
pg <- read_html("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")
qtr_nodes <- html_nodes(pg, "select[id='quarter_one'] option")
data_frame(
qtr = html_text(qtr_nodes),
value = html_attr(qtr_nodes, "value")
) %>%
filter(!grepl("ubscri", qtr)) -> qtrs
qtrs
## # A tibble: 10 x 2
## qtr value
## <chr> <chr>
## 1 Current Combined 13F/13D/G -1
## 2 Q3 2017 13F Filings 67
## 3 Q2 2017 13F Filings 66
## 4 Q1 2017 13F Filings 65
## 5 Q4 2016 13F Filings 64
## 6 Q3 2016 13F Filings 63
## 7 Q2 2016 13F Filings 62
## 8 Q1 2016 13F Filings 61
## 9 Q4 2015 13F Filings 60
## 10 Q3 2015 13F Filings 59
^^是从漂亮名称到弹出值的转换表。该值是提交幕后发生的XHR请求所必需的。
让我们创建一个模拟XHR请求的函数:
get_qtr <- function(qtr) {
GET(
url = "https://whalewisdom.com/filer/holdings",
httr::add_headers(
Host = "whalewisdom.com",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:58.0) Gecko/20100101 Firefox/58.0",
Accept = "application/json, text/javascript, */*; q=0.01",
`Accept-Language` = "en-US,en;q=0.5",
Referer = "https://whalewisdom.com/filer/blue-harbour-group-lp",
`X-Requested-With` = "XMLHttpRequest",
Connection = "keep-alive"
),
query = list(
q1 = qtr,
id = "384", type_filter = "1,2,3,4", symbol = "",
change_filter = "1,2,3,4,5", minimum_ranking = "", minimum_shares = "",
is_etf = "0", sc = "true", `_search` = "false", rows = "25",
page = "1", sidx = "current_ranking", sord = "asc"
)
) -> res
stop_for_status(res)
res <- content(res)
map_df(res$rows, ~map(.x, ~ifelse(is.null(.x), NA, .x)))
}
我们只是将value
传递给qtr
参数,但您也可以为其他位添加参数。
现在,使用上面的转换表来获取随机选择的数据集:
qtr_65 <- get_qtr(65)
glimpse(qtr_65)
## Observations: 19
## Variables: 24
## $ id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ name <chr> "Investors Bancorp Inc", "Xilinx, Inc", "BWX TECHNOLOGIES INC", "AGCO Corp", "A...
## $ symbol <chr> "ISBC", "XLNX", "BWXT", "AGCO", "AVT", "WBMD", "AKAM", "RDC", "FFIV", "ADNT", "...
## $ permalink <chr> "isbc", "xlnx", "bwxt", "agco", "avt", "wbmd", "akam", "rdc", "ffiv", "adnt", "...
## $ security_type <chr> "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "...
## $ stock_id <int> 5284, 930, 7803, 600, 375, 838, 3527, 3658, 26, 198034, 5045, 72934, 812, 4116,...
## $ source_date <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ source_type <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ sector <chr> "FINANCE", "INFORMATION TECHNOLOGY", "INDUSTRIALS", "INDUSTRIALS", "INFORMATION...
## $ industry <chr> "TRUSTS & THRIFTS", "SEMICONDUCTORS", "ELECTRICAL EQUIPMENT", "MACHINERY", "ELE...
## $ current_shares <int> 29582428, 6058693, 5287927, 3813700, 4363874, 3361336, 2855493, 10542812, 93835...
## $ previous_shares <int> 29582428, 7514437, 10561086, 6835700, 5415074, 1795914, 2474193, 10542812, 8599...
## $ shares_change <int> 0, -1455744, -5273159, -3022000, -1051200, 1565422, 381300, 0, 78373, 1675570, ...
## $ position_change_type <chr> NA, "reduction", "reduction", "reduction", "reduction", "addition", "addition",...
## $ percent_shares_change <chr> "0.0", "-19.3726", "-49.9301", "-44.2091", "-19.4125", "87.1658", "15.4111", "0...
## $ current_ranking <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 999999, 999999, 999999
## $ previous_ranking <int> 3, 1, 2, 4, 5, 11, 7, 6, 8, 999999, 10, 9, 999999, 12, 14, 13, 15, 16, 17
## $ current_percent_of_portfolio <dbl> 15.3264, 12.6366, 9.0686, 8.2689, 7.1946, 6.3798, 6.1419, 5.9180, 4.8200, 4.387...
## $ previous_percent_of_portfolio <dbl> 13.7987, 15.1687, 14.0194, 13.2249, 8.6205, 2.9767, 5.5165, 6.6592, 4.1615, NA,...
## $ current_mv <chr> "425395000.0", "350738000.0", "251705000.0", "229508000.0", "199691000.0", "177...
## $ previous_mv <chr> "412675000.0", "453647000.0", "419275000.0", "395514000.0", "257812000.0", "890...
## $ percent_ownership <chr> "26.3298285", "2.4338473", "5.3287767", "4.5501506", "3.3856140", "8.6721677", ...
## $ quarter_first_owned <chr> "Q1 2014", "Q1 2015", "Q4 2013", "Q2 2014", "Q2 2015", "Q4 2016", "Q3 2016", "Q...
## $ quarter_id_owned <int> 53, 57, 52, 54, 58, 64, 63, 53, 61, 65, 47, 63, 65, 64, 64, 64, 64, 60, 64
我不知道^^是否是CSV中的内容,因为我没有注册帐户,但您可以验证并希望修改。