无法获取包含“Content-Disposition:attachment;”的网页使用python-requests

时间:2017-12-31 16:00:17

标签: python python-requests

使用我的firefox浏览器登录下载站点并单击其中一个查询按钮。弹出一个小窗口,名为“打开report1.csv”,我可以选择“打开方式”或“保存文件”。我保存了文件。

对于此操作,Live HTTP headers告诉我:

  

https:// myserver / ReportPage?download& NAME = ALL& DATE = THISYEAR

     

GET / ReportPage?download& NAME = ALL& DATE = THISYEAR HTTP / 1.1
  主持人:myserver
  User-Agent:Mozilla / 5.0(X11; Linux x86_64; rv:52.0)Gecko / 20100101 Firefox / 52.0
  接受:text / html,application / xhtml + xml,application / xml; q = 0.9, / ; q = 0.8
  Accept-Language:en-US,en; q = 0.8,de-DE; q = 0.5,de; q = 0.3
  Accept-Encoding:gzip,deflate,br
  推荐人:https:// myserver / ReportPage?4& NAME = ALL& DATE = THISYEAR
  Cookie:JSESSIONID = 88DEDBC6880571FDB0E6E4112D71B7D6
  连接:保持活力
  升级 - 不安全请求:1

     

HTTP / 1.1 200 OK
  日期:星期六,2017年12月30日22:37:40 GMT
  服务器:Apache-Coyote / 1.1
  最后修改时间:2017年12月30日星期六22:37:40 GMT
  到期日:1970年1月1日星期四00:00:00 GMT
  Pragma:no-cache
  缓存控制:无缓存,无存储
  内容 - 处理:附件;文件名= “report1.csv”;文件名* = UTF-8''report1.csv
  内容类型:text / csv
  内容长度:332369
  保持活跃:超时= 5,最大= 100
  连接:Keep-Alive

现在我尝试用请求模拟这个。

$ python3
>>> import requests
>>> from lxml import html
>>>
>>> s = requests.Session()
>>> s.verify = './myserver.crt'  # certificate of myserver for https
>>>
>>> # get the login web page to enter username and password
... r = s.get( 'https://myserver' )
>>>
>>> # Get url for logging in. It's the action-attribute in the form anywhere.
... # We use xpath.
... tree = html.fromstring(r.text)
>>> loginUrl = 'https://myserver/' + list(tree.xpath("//form[@id='id4']/@action"))[0]
>>> print( loginUrl )   # it contains a session-id
https://myserver/./;jsessionid=77EA70CB95252426439097E274286966?0-1.loginForm
>>>
>>> # logging in with username and password
... r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
>>> print( r.status_code )
200
>>> # try to get the download file using url from Live HTTP headers
... downloadQueryUrl = 'https://myserver/ReportPage?download&NAME=ALL&DATE=THISYEAR'
>>> r = s.get( downloadQueryUrl )
>>> print( r.status_code)
200
>>> print( r. headers )
{'Connection': 'Keep-Alive',
'Date': 'Sun, 31 Dec 2017 14:46:03 GMT',
'Cache-Control': 'no-cache, no-store',
'Keep-Alive': 'timeout=5, max=94',
'Transfer-Encoding': 'chunked',
'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT',
'Pragma': 'no-cache',
'Content-Encoding': 'gzip',
'Content-Type': 'text/html;charset=UTF-8',
'Server': 'Apache-Coyote/1.1',
'Vary': 'Accept-Encoding'}
>>> print( r.url )
https://myserver/ReportPage?4&NAME=ALL&DATE=THISYEAR
>>>

请求成功但我没有获得文件下载页面。没有“内容 - 处置:附件”;标题中的条目。我只获得查询开始的页面,例如来自引用者的页面。

这与session-cookie有关吗?似乎请求自动管理这个。 csv文件有特殊处理吗?我必须使用流吗? Live HTTP Headers显示的download-Url是正确的吗?也许有一个动态创作?

如何获得包含“Content-Disposition:attachment;”的网页来自myserver并使用请求下载其文件?

1 个答案:

答案 0 :(得分:1)

我明白了。 @Patrick Mevzek指出了正确的方向。谢谢你。

登录后,我不会留在第一个登录页面并调用查询。相反,我请求报告页面,从中提取query-url并请求query-url。现在我在其标题中得到了“Content-Disposition:attachment;”的回复。现在将它的文本打印到stdout很简单。我更喜欢这个,因为我可以将输出重定向到任何文件。信息消息转到stderr,因此它们不会弄乱重定向的输出。典型的通话是./download >out.csv

为了完整性,这里是脚本模板,没有任何错误检查以澄清其工作。

#!/usr/bin/python3

import requests
import sys
from lxml import html

s = requests.Session()
s.verify = './myserver.crt'  # certificate of myserver for https

# get the login web site to enter username and password
r = s.get( 'https://myserver' )

# Get url for logging in. It's the action-attribute in the form anywhere.
# We use xpath.
tree = html.fromstring(r.text)
loginUrl = 'https://myserver/' + tree.xpath("//form[@id='id4']/@action")[0]

# logging in with username and password and go to ReportPage with queries
r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
queryUrl = 'https://myserver/ReportPage?NAME=ALL&DATE=THISYEAR'
r = s.get( queryUrl )

# Get the download link for this query from this site. It's a link anywhere
# with value 'Download (UTF8)'
tree = html.fromstring( r.text )
downloadUrl = 'https://myserver/' + tree.xpath("//a[.='Download (UTF8)']/@href")[0]

# get the download file
r = s.get( downloadUrl )
if r.headers.get('Content-Disposition'):
    print( 'Downloading ...', file=sys.stderr )
    print( r.text )

# log out
r = s.get( 'https://myserver/logout' )