如何在python 2.7中获得真正的文件URL?

时间:2018-01-12 13:17:48

标签: python python-2.7 url redirect python-requests

我有一个网址http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip,可以将我重定向到http://images.vbb.de/assets/ftp/file/286316.zip。重定向引号,因为python说没有重定向:

    In [51]: response = requests.get('http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
        ...: if response.history:
        ...:     print "Request was redirected"
        ...:     for resp in response.history:
        ...:         print resp.status_code, resp.url
        ...:     print "Final destination:"
        ...:     print response.status_code, response.url
        ...: else:
        ...:     print "Request was not redirected"
        ...:     
    Request was not redirected

状态代码也是200. response.history什么都没有。 response.url给出第一个网址而不是真实网址。但是有可能在firefox中获得真正的网址 - >开发人员工具 - >网络。我如何在python 2.7中制作?提前致谢!!

2 个答案:

答案 0 :(得分:1)

您需要首先通过解析第一个返回的HTML中的新window.location.href来手动执行重定向。然后,这会创建一个301回复,其中包含返回的Location标头中包含的目标文件的名称:

import requests
import re
import os

base_url = 'http://www.vbb.de'
response = requests.get(base_url + '/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
manual_redirect = base_url + re.findall('window.location.href\s+=\s+"(.*?)"', response.text)[0]
response = requests.get(manual_redirect, stream=True)
target_filename = response.history[0].headers['Location'].split('/')[-1]

print "Downloading: '{}'".format(target_filename)
with open(target_filename, 'wb') as f_zip:
    for chunk in response.iter_content(chunk_size=1024):
        f_zip.write(chunk)

这会显示:

Downloading: '286316.zip'

并生成一个29,464,299字节的zip文件。

答案 1 :(得分:0)

您可以使用BeautifulSoup读取HTML页面标题中的元标记,并获取重定向网址。

>>> import requests
>>> from bs4 import BeautifulSoup
>>> a = requests.get("http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip")
>>> soup = BeautifulSoup(a.text, 'html.parser')
>>> soup.find_all('meta', attrs={'http-equiv': lambda x:x.lower() == 'refresh'})[0]['content'].split('URL=')[1]
'/de/download/GTFS_VBB_Nov2015_Dez2016.zip'

此网址将相对于原始网址的域名,从而生成新的网址http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip。下载此内容似乎为我下载了ZIP文件:

>>> a = requests.get("http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip", stream=True)
>>> with open('test.zip', 'wb') as f:
...     a.raw.decode_content = True
...     shutil.copyfileobj(a.raw, f)
...
 $ unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     5554  2015-11-20 15:17   agency.txt
  2151517  2015-11-20 15:17   calendar_dates.txt
    71731  2015-11-20 15:17   calendar.txt
    65424  2015-11-20 15:17   routes.txt
   816498  2015-11-20 15:17   stops.txt
196020096  2015-11-20 15:17   stop_times.txt
   365499  2015-11-20 15:17   transfers.txt
 11765292  2015-11-20 15:17   trips.txt
      113  2015-11-20 15:17   logging
---------                     -------
211261724                     9 files

在此重定向上,返回了301状态:

>>> a.history
[<Response [301]>]
>>> a
<Response [200]>
>>> a.history[0]
<Response [301]>
>>> a.history[0].url
'http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip'
>>> a.url
'http://images.vbb.de/assets/ftp/file/286316.zip'