Question

我尝试用python请求库打开一个html页面但我的代码打开了站点根文件夹，我不明白如何解决问题。

import requests

scraping = requests.request("POST", url = "http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")

print scraping.content

感谢您的所有建议！

Answer 1

您可以轻松看到服务器正在重定向到主页面。

➜  ~  http -v http://www.pollnet.it/WeeklyReport_it.aspx\?ID\=69
GET /WeeklyReport_it.aspx?ID=69 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.pollnet.it
User-Agent: HTTPie/0.9.3



HTTP/1.1 302 Found
Content-Length: 131
Content-Type: text/html; charset=utf-8
Date: Sun, 07 Feb 2016 11:24:52 GMT
Location: /default.asp
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET

<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="%2fdefault.asp">here</a>.</h2>
</body></html>

进一步检查时，可以看到Web服务器使用会话cookie。

➜  ~  http -v http://www.pollnet.it/default_it.asp

HTTP/1.1 200 OK
Cache-Control: private
Content-Encoding: gzip
Content-Length: 9471
Content-Type: text/html; Charset=utf-8
Date: Sun, 07 Feb 2016 13:21:41 GMT
Server: Microsoft-IIS/7.5
Set-Cookie: ASPSESSIONIDSQTSTAST=PBHDLEIDFCNMPKIGANFDNMLK; path=/
Vary: Accept-Encoding
X-Powered-By: ASP.NET

这意味着每次访问主页时，服务器都会发送一个＆＃34; Set-Cookie＆＃34;标头，指示浏览器设置某些cookie。然后，每次浏览器要求每周报告时，服务器都会验证会话cookie。

常。 requests包不会在请求之间保存cookie，但是为了进行抓取，我们可以使用Session对象来保存页面请求之间的cookie。

import requests

# create a Session object
s= requests.Session()

# first visit the main page
s.get("http://www.pollnet.it/default_it.asp")

# then we can visit the weekly report pages
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")

print(r.text)

# another page
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=89")
print(r.text)

但是这里有一些建议 - 网络服务器可能只允许使用某个Session对象打开固定数量的页面（可能是10，可能是15）。每次立即验证r.text的结果（可能检查请求体的长度以确保它不会太小），或者为每5或6页创建一个新的Session对象。< / p>

有关会话对象here的更多信息。

Python库请求打开错误的页面

1 个答案: