我的一位同事一直在使用财务数据 蟒蛇。他给我发了一段Python代码作为例子:
import requests
import bs4
response = requests.post('http://www.nasdaq.com/symbol/voo/historical',
data='2y|false|VOO',
headers={'Content-Type': 'application/json'})
html = bs4.BeautifulSoup(response.text, 'lxml')
table_data = [[td.text.strip() for td in tr('td')]
for tr in html('tr')][1:]
print table_data[0]
print table_data[-1]
以下是Python代码输出的示例:
$ python scrape-nasdaq.py
[u'16:00', u'206.21', u'207.27', u'205.95', u'206.74', u'2,983,048']
[u'12/08/2014', u'190.35', u'190.83', u'188.85', u'189.5', u'1,527,709']
这应该下载两年前有关Vanguard的股票数据 基金(我猜是随机选择或多或少),看起来确实如此 那。 (注意" 2014"输出中的字符串。)
有一段时间以来,我对网络抓取产生了被动的兴趣,但从未有过
有机会认真对待它。我把我同事的代码作为一种
挑战并决定尝试使用httr
模拟Python代码
package,IIUC,部分受到了BeautifulSoup Python的启发
封装
我试图尽可能地模拟Python代码,但我无法做到 获取下载数据的代码。更完整的描述是 有时我可以下载一些数据,即默认的3个月清单 来自网站的数据,但我无法通过该网站来满足我的请求 两年的数据。在其他时候,我只收到错误回复:
> stop_for_status(result)
Error: Service Unavailable (HTTP 503).
" good"运行代码我得到了" 200"回应(但是,正如我所说,不要 获取所有请求的数据)。我不知道它为何起作用(根本) 有时候,其他时候都会失败。
这里是R代码和结果(对于我得到一些数据的情况):
library (httr)
library(dplyr)
library(rvest)
base_url <- "http://www.nasdaq.com/symbol/voo/historical"
body <- list(data="2y|false|VOO")
headers <- '"Content-Type"="application/json"'
result <- POST(base_url,
add_headers(headers),
body = body,
encode="json",
verbose())
stop_for_status(result)
table_data <- content(result) %>%
html_nodes("table") %>%
html_table(header=TRUE) %>%
`[[`(1) %>%
slice(2:n())
names(table_data) <- sapply(names(table_data), function(name) {
unlist(strsplit(name, '\r'))[1]})
names(table_data)
head(table_data)
tail(table_data)
> head(table_data)
Date Open High Low Close / Last Volume
1 12/07/2016 203.45 206.320 203.300 206.20 2,253,230
2 12/06/2016 203.17 203.630 202.640 203.62 2,412,897
3 12/05/2016 202.64 203.300 202.415 202.91 2,070,675
4 12/02/2016 201.70 202.230 201.350 201.75 2,119,016
5 12/01/2016 202.68 202.710 201.240 201.61 3,281,407
6 11/30/2016 203.53 203.692 202.310 202.40 2,359,018
> tail(table_data)
Date Open High Low Close / Last Volume
60 09/14/2016 194.83 196.1199 194.12 194.66 2,966,319
61 09/13/2016 196.25 196.5100 194.12 194.78 3,361,848
62 09/12/2016 194.88 198.9500 194.80 198.48 2,800,160
63 09/09/2016 199.08 199.1800 195.67 195.68 3,430,638
64 09/08/2016 200.62 200.9199 200.19 200.52 2,180,474
65 09/07/2016 200.82 201.1500 200.32 200.99 1,455,442
>
从上面的头部/尾部可以看出,所有结果都局限于此 当天和前三个月。
欢迎提出建议。我已经附加了POST的详细输出 命令,以及我的会话信息。
- 迈克尔
-> POST /symbol/voo/historical HTTP/1.1
-> User-Agent: libcurl/7.35.0 r-curl/2.3 httr/1.2.1
-> Host: www.nasdaq.com
-> Accept-Encoding: gzip, deflate
-> Accept: application/json, text/xml, application/xml, */*
-> Content-Type: application/json
-> Content-Length: 23
->
>> {"data":"2y|false|VOO"}
<- HTTP/1.1 503 Service Unavailable
<- Content-Type: text/html; charset=us-ascii
<- Server: Microsoft-HTTPAPI/2.0
<- Content-Length: 326
<- Expires: Fri, 09 Dec 2016 01:58:03 GMT
<- Cache-Control: max-age=0, no-cache, no-store
<- Pragma: no-cache
<- Date: Fri, 09 Dec 2016 01:58:03 GMT
<- Connection: close
<- Set-Cookie:
NSC_W.TJUFEFGFOEFS.OBTEBR.80=ffffffffc3a08e3045525d5f4f58455e445a4a423660;expires=Fri,
09-Dec-2016 02:08:03 GMT;path=/;httponly
<-
> stop_for_status(result)
Error: Service Unavailable (HTTP 503).
>
> session_info()
Session info
-----------------------------------------------------------------------------------
setting value
version R version 3.3.2 (2016-10-31)
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
tz <NA>
date 2016-12-08
Packages
-------------------------------------------------------------
package * version date source
assertthat 0.1 2013-12-06 CRAN (R 3.2.1)
colorspace 1.3-1 2016-11-18 CRAN (R 3.3.2)
curl 2.3 2016-11-24 CRAN (R 3.3.2)
DBI 0.5-1 2016-09-10 CRAN (R 3.3.1)
devtools * 1.12.0 2016-06-24 CRAN (R 3.3.1)
digest 0.6.10 2016-08-02 CRAN (R 3.3.1)
dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.1)
ggplot2 * 2.2.0 2016-11-11 CRAN (R 3.3.2)
gtable 0.2.0 2016-02-26 CRAN (R 3.2.3)
hms 0.3 2016-11-22 CRAN (R 3.3.2)
httr * 1.2.1 2016-07-03 CRAN (R 3.3.1)
jsonlite 1.1 2016-09-14 CRAN (R 3.3.1)
lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
munsell 0.4.3 2016-02-13 CRAN (R 3.2.3)
plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
purrr * 0.2.2 2016-06-18 CRAN (R 3.3.0)
R6 2.2.0 2016-10-05 CRAN (R 3.3.1)
Rcpp 0.12.8 2016-11-17 CRAN (R 3.3.2)
readr * 1.0.0.9000 2016-11-01 Github (tidyverse/readr@b8c3ddb)
rvest * 0.3.2 2016-06-17 CRAN (R 3.3.0)
scales 0.4.1 2016-11-09 CRAN (R 3.3.2)
tibble * 1.2 2016-08-26 CRAN (R 3.3.1)
tidyr * 0.6.0 2016-08-12 CRAN (R 3.3.1)
tidyverse * 1.0.0 2016-09-09 CRAN (R 3.3.1)
withr 1.0.2 2016-06-20 CRAN (R 3.3.1)
xml2 * 1.0.0 2016-06-24 CRAN (R 3.3.1)
>
谢谢,哈德利。我不确定如何进行你建议的比较,但是这里 我切了。
首先来自R:
result_text <- content(result, "text")
result_text <- unlist(strsplit(result_text, "\r\n"))
result_text[1:10]
> result_text[1:10]
[1] "<div id=\"quotes_content_left_pnlAJAX\">" "\t"
[3] " <h3 class=\"table-headtag\">" " Results
for: 3 Month, From "
[5] "09-SEP-2016 TO 09-DEC-2016 " " </h3>"
[7] " <table>" " <thead>"
[9] " <tr>" "
<th>Date</th>"
现在是Python:
In [21]: response.text[0:255]
Out[21]: u'<div id="quotes_content_left_pnlAJAX">\r\n\t\r\n <h3
class="table-headtag">\r\n Results for: 2 Years, From
\r\n09-DEC-2014 TO 09-DEC-2016 \r\n </h3>\r\n
<table>\r\n <thead>\r\n <tr>\r\n '
据我所知,除了时间间隔外,这些是相同的。我的 唯一的想法就是我以某种方式破坏了&#34; body&#34;的语法。参数, 因此,我的时间间隔没有传达给服务器。
curl
毫无疑问这并不奇怪,但我得到的结果相同(只有
来自以下curl
代码的默认值,三个月的数据:
library(curl)
h <- new_handle()
handle_setopt(h, copypostfields = "data='2y|false|VOO'")
handle_setheaders(h,
"Content-Type"="application/json"
)
result <- curl_fetch_memory(base_url, handle = h)
答案 0 :(得分:0)
根据Duncan Temple Lang的建议,我已经能够下载了
数据&#34;广告&#34;。这是修改后的代码,显示了两种方法
下载(即httr
和RCurl
):
library (httr)
setup_query <- function(symbol, time_interval) {
base_url <- paste("http://www.nasdaq.com/symbol/",
tolower(symbol),
"/historical",
sep='')
data_string <- paste(time_interval, "|false|", symbol, sep='')
body <- c("data"=data_string)
headers <- c("Content-Type" = "application/json")
list(url=base_url, body=body, headers=headers)
}
get_table_data <- function(doc) {
doc %>%
read_html() %>%
html_nodes("table") %>%
html_table(header=TRUE) %>%
`[[`(1) %>%
slice(2:n()) -> table_data
names(table_data) <- sapply(names(table_data), function(name) {
unlist(strsplit(name, '\r'))[1]})
return(table_data)
}
query_params <- setup_query("VOO", "1y")
result_h <- POST(query_params$url,
add_headers(.headers=query_params$headers),
body = query_params$body,
encode="json",
verbose())
stop_for_status(result_h)
library(rvest)
library(dplyr)
table_data_h <- get_table_data(result_h)
head(table_data_h)
tail(table_data_h)
#### Duncan, 2017-03-07
library(RCurl)
query_params <- setup_query("VOO", "2y")
result_d = getURLContent(query_params$url,
postfields = as.character(query_params$body),
httpheader = query_params$headers,
customrequest = "POST")
table_data_d <- get_table_data(result_d)
head(table_data_d)
tail(table_data_d)
以下是今天运行代码的结果。请注意,我选择了两个 不同的时间间隔来突出差异。
> head(table_data_h)
Date Open High Low Close / Last Volume
1 03/09/2017 217.33 217.836 216.4602 217.41 1,634,713
2 03/08/2017 217.95 218.200 217.0700 217.18 1,784,348
3 03/07/2017 217.98 218.330 217.4200 217.64 1,742,746
4 03/06/2017 218.07 218.648 217.6300 218.30 1,530,987
5 03/03/2017 218.70 219.090 218.3080 218.98 1,765,450
6 03/02/2017 219.99 220.020 218.7500 218.86 1,650,637
> tail(table_data_h)
Date Open High Low Close / Last Volume
248 03/16/2016 184.88 186.93 184.86 186.57 2,700,654
249 03/15/2016 184.70 185.46 184.42 185.46 1,325,297
250 03/14/2016 185.40 186.20 185.07 185.70 1,412,806
251 03/11/2016 184.57 186.01 184.46 185.98 2,231,094
252 03/10/2016 183.38 184.39 181.03 183.01 2,089,526
253 03/09/2016 182.85 183.22 182.01 182.91 1,641,079
> head(table_data_d)
Date Open High Low Close / Last Volume
1 03/09/2017 217.33 217.836 216.4602 217.41 1,634,713
2 03/08/2017 217.95 218.200 217.0700 217.18 1,784,348
3 03/07/2017 217.98 218.330 217.4200 217.64 1,742,746
4 03/06/2017 218.07 218.648 217.6300 218.30 1,530,987
5 03/03/2017 218.70 219.090 218.3080 218.98 1,765,450
6 03/02/2017 219.99 220.020 218.7500 218.86 1,650,637
> tail(table_data_d)
Date Open High Low Close / Last Volume
501 03/16/2015 189.59 191.380 189.55 191.38 1,054,150
502 03/13/2015 189.68 189.790 187.62 188.80 1,792,236
503 03/12/2015 188.21 190.000 188.20 189.96 2,313,157
504 03/11/2015 188.26 188.470 187.47 187.54 1,857,393
505 03/10/2015 189.57 189.632 187.97 187.98 2,390,401
506 03/09/2015 190.52 191.470 190.42 191.12 1,131,634
>