我遇到了一个公共数据集,我无法弄清楚如何直接进入R.通常,我使用以下R代码从网上提取数据:
temp <- tempfile()
download.file("http://www.webaddress.com",temp)
data <- read.csv(unz(temp, "name_of_file"))
unlink(temp)
然而,这个SEC网站让我对如何直接进入R有点困惑。一个原因是,当你右键点击链接而不是网址时,你得到以下代码:
javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$cphMain$lnkSECReport", "", false, "", "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", false, true))
以下是网址:http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx
有没有办法将这些数据直接输入R?截至目前,我下载然后用7-zip打开,保存到excel,然后导入到R.
更新代码
library(httr)
library(xml2)
res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC",
httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"),
body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"),
encode = "form")
writeBin(content(res, as="raw"), "report.gz")
gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
close(gzf)
xml_find_all(doc, ".//Firms/Firm/Info") %>%
xml_attr("LegalNm") %>%
head(10)
答案 0 :(得分:3)
这是一个真正的,可怕的,可怜的SharePoint网站之一,它们在全球几乎所有的政府电子计划中疯狂地出现,并且使数据更加不透明。
话虽如此,我对此非常惊讶:
library(httr)
library(xml2)
res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC",
httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"),
body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"),
encode = "form")
我在取消直接下载并在开发者工具中查看所述网络电话(在下载开始之前必须启动)后,使用curlconverter
来提取网络电话。
“原始”计算httr
请求函数如下所示:
httr::VERB(verb = "POST", url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC",
httr::add_headers(Origin = "http://www.adviserinfo.sec.gov",
`Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8",
`Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36",
Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
`Cache-Control` = "max-age=0", Referer = "http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx",
Connection = "keep-alive", DNT = "1"), httr::set_cookies(ASP.NET_SessionId = "vp5bt2nrl5m3l4tqq4mkbfrz"),
body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport",
`__EVENTARGUMENT` = "", `__VIEWSTATE` = "/wEPDwUIOTg2OTY2NjYPZBYCZg9kFgQCAQ8WAh4EVGV4dAUeSUFQRCAtIEludmVzdG1lbnQgQWR2aXNlciBEYXRhZAIDD2QWAgIFD2QWEAIDDw8WAh4LUG9zdEJhY2tVcmwFUn4vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMWAh4Hb25jbGljawWvAWdhKCdzZW5kJywgJ3BhZ2V2aWV3JywgeydwYWdlJzogJ34vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMnLCAndGl0bGUnOiAnSUFQRCAtIFNFQyBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J0IChHWklQKSd9KTtkAgcPZBYCZg8PFgIfAAVKUmVwb3J0IGFzIG9mOiA8Yj5TZXB0ZW1iZXIgNiwgMjAxNjwvYj4gPGJyLz5BcHByb3hpbWF0ZSBmaWxlIHNpemU6IDM3IE1CICBkZAINDw8WAh8BBVR+L0lBUEQvQ29udGVudC9CdWxrRmVlZC9Db21waWxhdGlvbkRvd25sb2FkLmFzcHg/RmVlZFBLPTM3MjY1JkZlZWRUeXBlPUlBX0ZJUk1fU1RBVEUWAh8CBbMBZ2EoJ3NlbmQnLCAncGFnZXZpZXcnLCB7J3BhZ2UnOiAnfi9JQVBEL0NvbnRlbnQvQnVsa0ZlZWQvQ29tcGlsYXRpb25Eb3dubG9hZC5hc3B4P0ZlZWRQSz0zNzI2NSZGZWVkVHlwZT1JQV9GSVJNX1NUQVRFJywgJ3RpdGxlJzogJ0lBUEQgLSBTdGF0ZSBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J... <truncated>
`__VIEWSTATEGENERATOR` = "C7F140E8", `__PREVIOUSPAGE` = "_n_AIWFFdFo0uFQroVexEbLyjk41mQczgUv0yM_5WfsMAs5Mr4_W9OsfhauW1md49E6AtLMLKvwsM3efjdsFxSQVs8m60rXjM2G3a38s-vs9jeifY7Z97KwNciQDnS3E0",
`__EVENTVALIDATION` = "/wEdAAQgBK7oCoSH1SyM/nnv4+7OQ6BBh5UglL0V4PbvTmfHL5ETgQBTBoVSpnQmZd0nxKz/1ubqHHzGDP0ztOLUKJjXWi90IlgKV4uaEBSHcRvGBiO1/K20oSh88Xa2qq9BBCI="),
encode = "form")
而且,根据我的经验,这些真正邪恶的SharePoint网站需要各种“查看状态”信息,但我拍了一下以减少并改变了呼叫并且它正在工作(至少在我去了之后的2分钟内该网站最初)。
从那以后,你还没有走出困境:
res$headers$`content-type`
## "application/x-gzip; charset=utf-8"
即使你添加:
`Accept-Encoding` = "gzip, deflate"
到add_headers()
来电。
所以,由于memDecompress()
是一个绝对无用的函数,你需要:
writeBin(content(res, as="raw"), "report.gz")
将gzip内容放入文件中。
现在,我们可以直接使用它:
gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
## [1] "LAUNCH ANGELS MANAGEMENT COMPANY, LLC" "JACOBSEN CAPITAL MANAGEMENT, LLC"
## [3] "CORESTATES CAPITAL ADVISORS, LLC" "MINNEAPOLIS PORTFOLIO MANAGEMENT GROUP, LLC"
## [5] "SHANNON RIVER FUND MANAGEMENT, LLC" "AAC BENELUX HOLDING BV"
## [7] "WILLINK ASSET MANAGEMENT LLC" "SPIVAK ASSET MANAGEMENT, LLC"
## [9] "ANNALY MANAGEMENT COMPANY LLC" "WOODMONT INVESTMENT COUNSEL, LLC"
close(gzf)
xml_find_all(doc, ".//Firms/Firm/Info") %>%
xml_attr("LegalNm") %>%
head(10)
我没试过,但我怀疑你可以接受:
javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(
-----> "ctl00$cphMain$lnkStateReport",
"",
false,
"",
-----> "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37265&FeedType=IA_FIRM_STATE",
false,
true))
----->
标识的项目,并将它们放在url
和body
区域的显而易见的位置以获取其他内容。这些参数来自“州投资顾问报告”按钮链接源。
如果你真的不想将内容写入文件,你可以在我的alpha包中尝试一个非暴露函数来直接在R中膨胀gzip的原始内容:
devtools::install_git("https://gitlab.com/hrbrmstr/warc.gz")
raw_report <- warc:::gzuncompress(content(res, as="raw"), 50*1024*1024)
doc <- read_xml(raw_report)
...