Question

使用我的firefox浏览器登录下载站点并单击其中一个查询按钮。弹出一个小窗口，名为“打开report1.csv”，我可以选择“打开方式”或“保存文件”。我保存了文件。

对于此操作，Live HTTP headers告诉我：

https：// myserver / ReportPage？download＆amp; NAME = ALL＆amp; DATE = THISYEAR

GET / ReportPage？download＆amp; NAME = ALL＆amp; DATE = THISYEAR HTTP / 1.1
  主持人：myserver
  User-Agent：Mozilla / 5.0（X11; Linux x86_64; rv：52.0）Gecko / 20100101 Firefox / 52.0
  接受：text / html，application / xhtml + xml，application / xml; q = 0.9， / ; q = 0.8
  Accept-Language：en-US，en; q = 0.8，de-DE; q = 0.5，de; q = 0.3
  Accept-Encoding：gzip，deflate，br
  推荐人：https：// myserver / ReportPage？4＆amp; NAME = ALL＆amp; DATE = THISYEAR
  Cookie：JSESSIONID = 88DEDBC6880571FDB0E6E4112D71B7D6
  连接：保持活力
  升级 - 不安全请求：1

HTTP / 1.1 200 OK
  日期：星期六，2017年12月30日22:37:40 GMT
  服务器：Apache-Coyote / 1.1
  最后修改时间：2017年12月30日星期六22:37:40 GMT
  到期日：1970年1月1日星期四00:00:00 GMT
  Pragma：no-cache
  缓存控制：无缓存，无存储
  内容 - 处理：附件;文件名= “report1.csv”;文件名* = UTF-8''report1.csv
  内容类型：text / csv
  内容长度：332369
  保持活跃：超时= 5，最大= 100
  连接：Keep-Alive

现在我尝试用请求模拟这个。

$ python3
>>> import requests
>>> from lxml import html
>>>
>>> s = requests.Session()
>>> s.verify = './myserver.crt'  # certificate of myserver for https
>>>
>>> # get the login web page to enter username and password
... r = s.get( 'https://myserver' )
>>>
>>> # Get url for logging in. It's the action-attribute in the form anywhere.
... # We use xpath.
... tree = html.fromstring(r.text)
>>> loginUrl = 'https://myserver/' + list(tree.xpath("//form[@id='id4']/@action"))[0]
>>> print( loginUrl )   # it contains a session-id
https://myserver/./;jsessionid=77EA70CB95252426439097E274286966?0-1.loginForm
>>>
>>> # logging in with username and password
... r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
>>> print( r.status_code )
200
>>> # try to get the download file using url from Live HTTP headers
... downloadQueryUrl = 'https://myserver/ReportPage?download&NAME=ALL&DATE=THISYEAR'
>>> r = s.get( downloadQueryUrl )
>>> print( r.status_code)
200
>>> print( r. headers )
{'Connection': 'Keep-Alive',
'Date': 'Sun, 31 Dec 2017 14:46:03 GMT',
'Cache-Control': 'no-cache, no-store',
'Keep-Alive': 'timeout=5, max=94',
'Transfer-Encoding': 'chunked',
'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT',
'Pragma': 'no-cache',
'Content-Encoding': 'gzip',
'Content-Type': 'text/html;charset=UTF-8',
'Server': 'Apache-Coyote/1.1',
'Vary': 'Accept-Encoding'}
>>> print( r.url )
https://myserver/ReportPage?4&NAME=ALL&DATE=THISYEAR
>>>

请求成功但我没有获得文件下载页面。没有“内容 - 处置：附件”;标题中的条目。我只获得查询开始的页面，例如来自引用者的页面。

这与session-cookie有关吗？似乎请求自动管理这个。 csv文件有特殊处理吗？我必须使用流吗？ Live HTTP Headers显示的download-Url是正确的吗？也许有一个动态创作？

如何获得包含“Content-Disposition：attachment;”的网页来自myserver并使用请求下载其文件？

Answer 1

我明白了。 @Patrick Mevzek指出了正确的方向。谢谢你。

登录后，我不会留在第一个登录页面并调用查询。相反，我请求报告页面，从中提取query-url并请求query-url。现在我在其标题中得到了“Content-Disposition：attachment;”的回复。现在将它的文本打印到stdout很简单。我更喜欢这个，因为我可以将输出重定向到任何文件。信息消息转到stderr，因此它们不会弄乱重定向的输出。典型的通话是./download >out.csv。

为了完整性，这里是脚本模板，没有任何错误检查以澄清其工作。

#!/usr/bin/python3

import requests
import sys
from lxml import html

s = requests.Session()
s.verify = './myserver.crt'  # certificate of myserver for https

# get the login web site to enter username and password
r = s.get( 'https://myserver' )

# Get url for logging in. It's the action-attribute in the form anywhere.
# We use xpath.
tree = html.fromstring(r.text)
loginUrl = 'https://myserver/' + tree.xpath("//form[@id='id4']/@action")[0]

# logging in with username and password and go to ReportPage with queries
r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
queryUrl = 'https://myserver/ReportPage?NAME=ALL&DATE=THISYEAR'
r = s.get( queryUrl )

# Get the download link for this query from this site. It's a link anywhere
# with value 'Download (UTF8)'
tree = html.fromstring( r.text )
downloadUrl = 'https://myserver/' + tree.xpath("//a[.='Download (UTF8)']/@href")[0]

# get the download file
r = s.get( downloadUrl )
if r.headers.get('Content-Disposition'):
    print( 'Downloading ...', file=sys.stderr )
    print( r.text )

# log out
r = s.get( 'https://myserver/logout' )

无法获取包含“Content-Disposition：attachment;”的网页使用python-requests

1 个答案: