如何使用HTTPS获取URL

时间:2017-11-10 04:15:54

标签: curl https web-crawler wget

我想用wget下载“https://www.luisaviaroma.com/en-us/shop/home”。我试过了wget --no-cookie --no-check-certificate https://www.luisaviaroma.com/en-us/shop/home。但它显示 HTTP request sent, awaiting response...而没有更多回复。我该如何使用Wget下载页面?

2 个答案:

答案 0 :(得分:1)

我认为这里的主要区别是user-agent标题。看起来这个主机服务器拒绝了Wget的user-agent标头,因此您可以像浏览器一样发送这些标头。我从Chrome中复制了我的作品:

wget https://www.luisaviaroma.com/en-us/shop/home --header="Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"

答案 1 :(得分:0)

该网址上的此网页似乎只能使用HTTP/2访问,例如使用cURL

curl -v 'https://www.luisaviaroma.com/en-us/shop/home' \
     -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

给出:

* Using HTTP/2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55d1310c6da0)
> GET /en-us/shop/home HTTP/1.1
> Host: www.luisaviaroma.com
> Accept: */*
> User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200
< content-type: text/html; charset=utf-8
< server: Microsoft-IIS/8.5
< x-aspnet-version: 4.0.30319
< x-powered-by: ASP.NET
< access-control-allow-origin: *
< x-akamai-transformed: 9 - 0 pmb=mTOE,2
< expires: Mon, 13 Nov 2017 21:34:13 GMT
< cache-control: max-age=0, no-cache, no-store
< pragma: no-cache
< date: Mon, 13 Nov 2017 21:34:13 GMT

但是当强制使用HTTP 1.0或HTTP 1.1时,

curl -v 'https://www.luisaviaroma.com/en-us/shop/home' \
     -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' \
     --http1.0

它给出了不同的结果:

> GET /en-us/shop/home HTTP/1.0
> Host: www.luisaviaroma.com
> Accept: */*
> User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: Apache
< ETag: "9c027fecce6719909433b8a37b2b403a:1493300815"
< Last-Modified: Thu, 27 Apr 2017 13:46:55 GMT
< Accept-Ranges: bytes
< Content-Length: 31186
< Content-Type: text/html
< Expires: Mon, 13 Nov 2017 21:36:19 GMT
< Cache-Control: max-age=0, no-cache, no-store
< Pragma: no-cache
< Date: Mon, 13 Nov 2017 21:36:19 GMT
< Connection: close

使用HTTP 1.0或HTTP 1.1,我们点击服务于不同页面的不同服务器,而使用HTTP 2.0则会提供预期的页面。

使用支持HTTP 2.0的cURL,您可以下载此页面,例如将其保存在homepage.html下:

curl 'https://www.luisaviaroma.com/en-us/shop/home' \
     -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' \
     -o homepage.html