将SharePoint网站数据导入R

时间:2016-09-06 19:25:09

标签: r

我遇到了一个公共数据集,我无法弄清楚如何直接进入R.通常,我使用以下R代码从网上提取数据:

temp <- tempfile()
download.file("http://www.webaddress.com",temp)
data <- read.csv(unz(temp, "name_of_file"))
unlink(temp)

然而,这个SEC网站让我对如何直接进入R有点困惑。一个原因是,当你右键点击链接而不是网址时,你得到以下代码:

javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$cphMain$lnkSECReport", "", false, "", "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", false, true))

以下是网址:http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx

有没有办法将这些数据直接输入R?截至目前,我下载然后用7-zip打开,保存到excel,然后导入到R.

更新代码

library(httr)
library(xml2)

res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
            httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"), 
            body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"), 
            encode = "form")

writeBin(content(res, as="raw"), "report.gz")
gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
close(gzf)


xml_find_all(doc, ".//Firms/Firm/Info") %>% 
  xml_attr("LegalNm") %>% 
  head(10)

1 个答案:

答案 0 :(得分:3)

这是一个真正的,可怕的,可怜的SharePoint网站之一,它们在全球几乎所有的政府电子计划中疯狂地出现,并且使数据更加不透明。

话虽如此,我对此非常惊讶:

library(httr)
library(xml2)

res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
           httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"), 
           body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"), 
           encode = "form")

我在取消直接下载并在开发者工具中查看所述网络电话(在下载开始之前必须启动)后,使用curlconverter来提取网络电话。

“原始”计算httr请求函数如下所示:

httr::VERB(verb = "POST", url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
           httr::add_headers(Origin = "http://www.adviserinfo.sec.gov", 
                             `Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8", 
                             `Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36", 
                             Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
                             `Cache-Control` = "max-age=0", Referer = "http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx", 
                             Connection = "keep-alive", DNT = "1"), httr::set_cookies(ASP.NET_SessionId = "vp5bt2nrl5m3l4tqq4mkbfrz"), 
           body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport", 
                       `__EVENTARGUMENT` = "", `__VIEWSTATE` = "/wEPDwUIOTg2OTY2NjYPZBYCZg9kFgQCAQ8WAh4EVGV4dAUeSUFQRCAtIEludmVzdG1lbnQgQWR2aXNlciBEYXRhZAIDD2QWAgIFD2QWEAIDDw8WAh4LUG9zdEJhY2tVcmwFUn4vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMWAh4Hb25jbGljawWvAWdhKCdzZW5kJywgJ3BhZ2V2aWV3JywgeydwYWdlJzogJ34vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMnLCAndGl0bGUnOiAnSUFQRCAtIFNFQyBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J0IChHWklQKSd9KTtkAgcPZBYCZg8PFgIfAAVKUmVwb3J0IGFzIG9mOiA8Yj5TZXB0ZW1iZXIgNiwgMjAxNjwvYj4gPGJyLz5BcHByb3hpbWF0ZSBmaWxlIHNpemU6IDM3IE1CICBkZAINDw8WAh8BBVR+L0lBUEQvQ29udGVudC9CdWxrRmVlZC9Db21waWxhdGlvbkRvd25sb2FkLmFzcHg/RmVlZFBLPTM3MjY1JkZlZWRUeXBlPUlBX0ZJUk1fU1RBVEUWAh8CBbMBZ2EoJ3NlbmQnLCAncGFnZXZpZXcnLCB7J3BhZ2UnOiAnfi9JQVBEL0NvbnRlbnQvQnVsa0ZlZWQvQ29tcGlsYXRpb25Eb3dubG9hZC5hc3B4P0ZlZWRQSz0zNzI2NSZGZWVkVHlwZT1JQV9GSVJNX1NUQVRFJywgJ3RpdGxlJzogJ0lBUEQgLSBTdGF0ZSBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J... <truncated>
                       `__VIEWSTATEGENERATOR` = "C7F140E8", `__PREVIOUSPAGE` = "_n_AIWFFdFo0uFQroVexEbLyjk41mQczgUv0yM_5WfsMAs5Mr4_W9OsfhauW1md49E6AtLMLKvwsM3efjdsFxSQVs8m60rXjM2G3a38s-vs9jeifY7Z97KwNciQDnS3E0", 
                       `__EVENTVALIDATION` = "/wEdAAQgBK7oCoSH1SyM/nnv4+7OQ6BBh5UglL0V4PbvTmfHL5ETgQBTBoVSpnQmZd0nxKz/1ubqHHzGDP0ztOLUKJjXWi90IlgKV4uaEBSHcRvGBiO1/K20oSh88Xa2qq9BBCI="), 
                       encode = "form")

而且,根据我的经验,这些真正邪恶的SharePoint网站需要各种“查看状态”信息,但我拍了一下以减少并改变了呼叫并且它正在工作(至少在我去了之后的2分钟内该网站最初)。

从那以后,你还没有走出困境:

res$headers$`content-type`
## "application/x-gzip; charset=utf-8"

即使你添加:

`Accept-Encoding` = "gzip, deflate"

add_headers()来电。

所以,由于memDecompress()是一个绝对无用的函数,你需要:

writeBin(content(res, as="raw"), "report.gz")

将gzip内容放入文件中。

现在,我们可以直接使用它:

gzf <- gzfile("report.gz")

doc <- read_xml(gzf)
## [1] "LAUNCH ANGELS MANAGEMENT COMPANY, LLC"       "JACOBSEN CAPITAL MANAGEMENT, LLC"           
## [3] "CORESTATES CAPITAL ADVISORS, LLC"            "MINNEAPOLIS PORTFOLIO MANAGEMENT GROUP, LLC"
## [5] "SHANNON RIVER FUND MANAGEMENT, LLC"          "AAC BENELUX HOLDING BV"                     
## [7] "WILLINK ASSET MANAGEMENT LLC"                "SPIVAK ASSET MANAGEMENT, LLC"               
## [9] "ANNALY MANAGEMENT COMPANY LLC"               "WOODMONT INVESTMENT COUNSEL, LLC"           
close(gzf)

xml_find_all(doc, ".//Firms/Firm/Info") %>% 
  xml_attr("LegalNm") %>% 
  head(10)

我没试过,但我怀疑你可以接受:

javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(
  -----> "ctl00$cphMain$lnkStateReport", 
  "", 
  false, 
  "", 
  -----> "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37265&FeedType=IA_FIRM_STATE", 
  false, 
  true))

----->标识的项目,并将它们放在urlbody区域的显而易见的位置以获取其他内容。这些参数来自“州投资顾问报告”按钮链接源。

如果你真的不想将内容写入文件,你可以在我的alpha包中尝试一个非暴露函数来直接在R中膨胀gzip的原始内容:

devtools::install_git("https://gitlab.com/hrbrmstr/warc.gz")

raw_report <- warc:::gzuncompress(content(res, as="raw"), 50*1024*1024)
doc <- read_xml(raw_report)
...