使用R httr POST命令

时间:2016-12-09 02:15:34

标签: post httr

我的一位同事一直在使用财务数据 蟒蛇。他给我发了一段Python代码作为例子:

import requests
import bs4

response = requests.post('http://www.nasdaq.com/symbol/voo/historical',
                         data='2y|false|VOO',
                         headers={'Content-Type': 'application/json'})

html = bs4.BeautifulSoup(response.text, 'lxml')
table_data = [[td.text.strip() for td in tr('td')]
              for tr in html('tr')][1:]

print table_data[0]
print table_data[-1]

以下是Python代码输出的示例:

$ python scrape-nasdaq.py
[u'16:00', u'206.21', u'207.27', u'205.95', u'206.74', u'2,983,048']
[u'12/08/2014', u'190.35', u'190.83', u'188.85', u'189.5', u'1,527,709']

这应该下载两年前有关Vanguard的股票数据 基金(我猜是随机选择或多或少),看起来确实如此 那。 (注意" 2014"输出中的字符串。)

有一段时间以来,我对网络抓取产生了被动的兴趣,但从未有过 有机会认真对待它。我把我同事的代码作为一种 挑战并决定尝试使用httr模拟Python代码 package,IIUC,部分受到了BeautifulSoup Python的启发 封装

我试图尽可能地模拟Python代码,但我无法做到 获取下载数据的代码。更完整的描述是 有时我可以下载一些数据,即默认的3个月清单 来自网站的数据,但我无法通过该网站来满足我的请求 两年的数据。在其他时候,我只收到错误回复:

> stop_for_status(result)
Error: Service Unavailable (HTTP 503).

" good"运行代码我得到了" 200"回应(但是,正如我所说,不要 获取所有请求的数据)。我不知道它为何起作用(根本) 有时候,其他时候都会失败。

这里是R代码和结果(对于我得到一些数据的情况):

library (httr)
library(dplyr)
library(rvest)

base_url <- "http://www.nasdaq.com/symbol/voo/historical"

body     <- list(data="2y|false|VOO")
headers  <- '"Content-Type"="application/json"'

result <- POST(base_url,
               add_headers(headers),
               body = body,
               encode="json",
               verbose())
stop_for_status(result)

table_data <- content(result) %>%
    html_nodes("table")  %>%
    html_table(header=TRUE) %>%
    `[[`(1) %>%
    slice(2:n())

names(table_data) <- sapply(names(table_data), function(name) {
                            unlist(strsplit(name, '\r'))[1]})
names(table_data)
head(table_data)
tail(table_data)

> head(table_data)
        Date   Open    High     Low Close / Last    Volume
1 12/07/2016 203.45 206.320 203.300       206.20 2,253,230
2 12/06/2016 203.17 203.630 202.640       203.62 2,412,897
3 12/05/2016 202.64 203.300 202.415       202.91 2,070,675
4 12/02/2016 201.70 202.230 201.350       201.75 2,119,016
5 12/01/2016 202.68 202.710 201.240       201.61 3,281,407
6 11/30/2016 203.53 203.692 202.310       202.40 2,359,018

> tail(table_data)
         Date   Open     High    Low Close / Last    Volume
60 09/14/2016 194.83 196.1199 194.12       194.66 2,966,319
61 09/13/2016 196.25 196.5100 194.12       194.78 3,361,848
62 09/12/2016 194.88 198.9500 194.80       198.48 2,800,160
63 09/09/2016 199.08 199.1800 195.67       195.68 3,430,638
64 09/08/2016 200.62 200.9199 200.19       200.52 2,180,474
65 09/07/2016 200.82 201.1500 200.32       200.99 1,455,442
> 

从上面的头部/尾部可以看出,所有结果都局限于此 当天和前三个月。

欢迎提出建议。我已经附加了POST的详细输出 命令,以及我的会话信息。

- 迈克尔

附录1:POST命令的详细输出

-> POST /symbol/voo/historical HTTP/1.1
-> User-Agent: libcurl/7.35.0 r-curl/2.3 httr/1.2.1
-> Host: www.nasdaq.com
-> Accept-Encoding: gzip, deflate
-> Accept: application/json, text/xml, application/xml, */*
-> Content-Type: application/json
-> Content-Length: 23
-> 
>> {"data":"2y|false|VOO"}

<- HTTP/1.1 503 Service Unavailable
<- Content-Type: text/html; charset=us-ascii
<- Server: Microsoft-HTTPAPI/2.0
<- Content-Length: 326
<- Expires: Fri, 09 Dec 2016 01:58:03 GMT
<- Cache-Control: max-age=0, no-cache, no-store
<- Pragma: no-cache
<- Date: Fri, 09 Dec 2016 01:58:03 GMT
<- Connection: close
<- Set-Cookie:
NSC_W.TJUFEFGFOEFS.OBTEBR.80=ffffffffc3a08e3045525d5f4f58455e445a4a423660;expires=Fri,
09-Dec-2016 02:08:03 GMT;path=/;httponly
<- 
> stop_for_status(result)
Error: Service Unavailable (HTTP 503).
> 

附录2:会话信息

> session_info()
Session info
-----------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_US                       
 collate  en_US.UTF-8                 
 tz       <NA>                        
 date     2016-12-08                  

Packages
-------------------------------------------------------------
 package    * version    date       source                          
 assertthat   0.1        2013-12-06 CRAN (R 3.2.1)                  
 colorspace   1.3-1      2016-11-18 CRAN (R 3.3.2)                  
 curl         2.3        2016-11-24 CRAN (R 3.3.2)                  
 DBI          0.5-1      2016-09-10 CRAN (R 3.3.1)                  
 devtools   * 1.12.0     2016-06-24 CRAN (R 3.3.1)                  
 digest       0.6.10     2016-08-02 CRAN (R 3.3.1)                  
 dplyr      * 0.5.0      2016-06-24 CRAN (R 3.3.1)                  
 ggplot2    * 2.2.0      2016-11-11 CRAN (R 3.3.2)                  
 gtable       0.2.0      2016-02-26 CRAN (R 3.2.3)                  
 hms          0.3        2016-11-22 CRAN (R 3.3.2)                  
 httr       * 1.2.1      2016-07-03 CRAN (R 3.3.1)                  
 jsonlite     1.1        2016-09-14 CRAN (R 3.3.1)                  
 lazyeval     0.2.0      2016-06-12 CRAN (R 3.3.0)                  
 magrittr     1.5        2014-11-22 CRAN (R 3.2.0)                  
 memoise      1.0.0      2016-01-29 CRAN (R 3.2.3)                  
 munsell      0.4.3      2016-02-13 CRAN (R 3.2.3)                  
 plyr         1.8.4      2016-06-08 CRAN (R 3.3.0)                  
 purrr      * 0.2.2      2016-06-18 CRAN (R 3.3.0)                  
 R6           2.2.0      2016-10-05 CRAN (R 3.3.1)                  
 Rcpp         0.12.8     2016-11-17 CRAN (R 3.3.2)                  
 readr      * 1.0.0.9000 2016-11-01 Github (tidyverse/readr@b8c3ddb)
 rvest      * 0.3.2      2016-06-17 CRAN (R 3.3.0)                  
 scales       0.4.1      2016-11-09 CRAN (R 3.3.2)                  
 tibble     * 1.2        2016-08-26 CRAN (R 3.3.1)                  
 tidyr      * 0.6.0      2016-08-12 CRAN (R 3.3.1)                  
 tidyverse  * 1.0.0      2016-09-09 CRAN (R 3.3.1)                  
 withr        1.0.2      2016-06-20 CRAN (R 3.3.1)                  
 xml2       * 1.0.0      2016-06-24 CRAN (R 3.3.1)                  
> 

回复:哈德利的建议:

谢谢,哈德利。我不确定如何进行你建议的比较,但是这里 我切了。

首先来自R:

result_text <- content(result, "text")
result_text <- unlist(strsplit(result_text, "\r\n"))
result_text[1:10]

> result_text[1:10]
 [1] "<div id=\"quotes_content_left_pnlAJAX\">"    "\t"                                         
 [3] "            <h3 class=\"table-headtag\">"    "                Results
for: 3 Month, From "
 [5] "09-SEP-2016  TO 09-DEC-2016 "                "            </h3>"                          
 [7] "            <table>"                         "                <thead>"                    
 [9] "                    <tr>"                    "
<th>Date</th>"

现在是Python:

In [21]: response.text[0:255]
Out[21]: u'<div id="quotes_content_left_pnlAJAX">\r\n\t\r\n            <h3
class="table-headtag">\r\n                Results for: 2 Years, From
\r\n09-DEC-2014  TO 09-DEC-2016 \r\n            </h3>\r\n
<table>\r\n                <thead>\r\n                    <tr>\r\n      '

据我所知,除了时间间隔外,这些是相同的。我的 唯一的想法就是我以某种方式破坏了&#34; body&#34;的语法。参数, 因此,我的时间间隔没有传达给服务器。

curl

的结果相同

毫无疑问这并不奇怪,但我得到的结果相同(只有 来自以下curl代码的默认值,三个月的数据:

library(curl)
h <- new_handle()
handle_setopt(h, copypostfields = "data='2y|false|VOO'")
handle_setheaders(h,
                  "Content-Type"="application/json"
                  )
result <- curl_fetch_memory(base_url, handle = h)

1 个答案:

答案 0 :(得分:0)

根据Duncan Temple Lang的建议,我已经能够下载了 数据&#34;广告&#34;。这是修改后的代码,显示了两种方法 下载(即httrRCurl):

library (httr)

setup_query <- function(symbol, time_interval) {

    base_url <- paste("http://www.nasdaq.com/symbol/",
                      tolower(symbol),
                      "/historical",
                      sep='')

    data_string <- paste(time_interval, "|false|", symbol, sep='')

    body <- c("data"=data_string)
    headers <- c("Content-Type" = "application/json")

    list(url=base_url, body=body, headers=headers)
}

get_table_data <- function(doc) {
    doc %>%
        read_html() %>%
        html_nodes("table")  %>%
        html_table(header=TRUE) %>%
        `[[`(1) %>%
        slice(2:n()) -> table_data

    names(table_data) <- sapply(names(table_data), function(name) {
        unlist(strsplit(name, '\r'))[1]})
    return(table_data)
}

query_params <- setup_query("VOO", "1y")

result_h <- POST(query_params$url,
                 add_headers(.headers=query_params$headers),
                 body = query_params$body,
                 encode="json",
                 verbose())

stop_for_status(result_h)


library(rvest)
library(dplyr)

table_data_h <- get_table_data(result_h)

head(table_data_h)
tail(table_data_h)


#### Duncan, 2017-03-07

library(RCurl)

query_params <- setup_query("VOO", "2y")

result_d = getURLContent(query_params$url,
                         postfields = as.character(query_params$body),
                         httpheader = query_params$headers,
                         customrequest = "POST")


table_data_d <- get_table_data(result_d)

head(table_data_d)
tail(table_data_d)

以下是今天运行代码的结果。请注意,我选择了两个 不同的时间间隔来突出差异。

> head(table_data_h)
        Date   Open    High      Low Close / Last    Volume
1 03/09/2017 217.33 217.836 216.4602       217.41 1,634,713
2 03/08/2017 217.95 218.200 217.0700       217.18 1,784,348
3 03/07/2017 217.98 218.330 217.4200       217.64 1,742,746
4 03/06/2017 218.07 218.648 217.6300       218.30 1,530,987
5 03/03/2017 218.70 219.090 218.3080       218.98 1,765,450
6 03/02/2017 219.99 220.020 218.7500       218.86 1,650,637

> tail(table_data_h)
          Date   Open   High    Low Close / Last    Volume
248 03/16/2016 184.88 186.93 184.86       186.57 2,700,654
249 03/15/2016 184.70 185.46 184.42       185.46 1,325,297
250 03/14/2016 185.40 186.20 185.07       185.70 1,412,806
251 03/11/2016 184.57 186.01 184.46       185.98 2,231,094
252 03/10/2016 183.38 184.39 181.03       183.01 2,089,526
253 03/09/2016 182.85 183.22 182.01       182.91 1,641,079

> head(table_data_d)
        Date   Open    High      Low Close / Last    Volume
1 03/09/2017 217.33 217.836 216.4602       217.41 1,634,713
2 03/08/2017 217.95 218.200 217.0700       217.18 1,784,348
3 03/07/2017 217.98 218.330 217.4200       217.64 1,742,746
4 03/06/2017 218.07 218.648 217.6300       218.30 1,530,987
5 03/03/2017 218.70 219.090 218.3080       218.98 1,765,450
6 03/02/2017 219.99 220.020 218.7500       218.86 1,650,637

> tail(table_data_d)
          Date   Open    High    Low Close / Last    Volume
501 03/16/2015 189.59 191.380 189.55       191.38 1,054,150
502 03/13/2015 189.68 189.790 187.62       188.80 1,792,236
503 03/12/2015 188.21 190.000 188.20       189.96 2,313,157
504 03/11/2015 188.26 188.470 187.47       187.54 1,857,393
505 03/10/2015 189.57 189.632 187.97       187.98 2,390,401
506 03/09/2015 190.52 191.470 190.42       191.12 1,131,634
>