Python3 urllib.request通常无法打开页面

时间:2016-05-18 11:56:20

标签: python beautifulsoup urllib

所以,我试图从S7航空公司网站获得票价(然后我想在机器学习的帮助下尝试预测价格变化)。

这个小解析器获取价格页面,然后在BeautifulSoup的帮助下找到所选日期的最低价格,然后将其保存在.csv文件中。

在"冷却"的请求之间30秒服务器。如果获得403错误,则等待120分钟并继续。每25个请求停止"冷却"。

问题是我经常遇到404错误。我在浏览器中查看了一些链接并且它们正常工作 - 页面已打开,一切正常。当我尝试只打开一个链接时,解析器会不时地执行此操作,但通常也会返回404错误。

我尝试添加超时,但它没有帮助。

我是编程新手,可能会犯明显的错误。

提前谢谢你。抱歉我的英语:)

from bs4 import BeautifulSoup as BS
import urllib.request as req
import re
import time
from urllib.error import HTTPError


def loader(dest, dest_large, dep_date):
    url = "http://travelwith.s7.ru/selectExactDateSearchFlights.action?TA=1&TC=0&TI=0&" \
          "CUR=RUB&FLC=1&FLX=false&RDMPTN=false&SC1=ANY&FSC1=1&DD1=2016-%s&" \
          "DA1=SVX&DP1=AIR_SVX_RU&AA1=%s&AP1=AIR_%s&LAN=ru" % (dep_date, dest, dest_large)
    request = req.Request(url,
                          headers={
                              'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0'})
    avia = req.urlopen(request).read()
    soup = BS(avia, 'html.parser')
    if len([re.sub('[^0-9]', '', p.get_text()) for p in soup.find_all("span", class_="radiobutton-text")]) > 0:
        price = min([re.sub('[^0-9]', '', p.get_text()) for p in soup.find_all("span", class_="radiobutton-text")])
    else:
        price = 0
    with open('prices/SVX_%s_%s.csv' % (dest, dep_date), 'a', encoding="utf8") as file:
        file.write(str(time.time())+';'+str(price)+'\n')
    print(date, destination, "OK!",price)


dep_date = ['06-08', '06-15', '06-22', '06-29', '07-01', '07-08', '07-15',
            '07-22', '07-29', '08-01', '08-08', '08-15', '08-22', '08-29',
            '09-01', '09-08', '09-15', '09-22', '09-29', '10-01', '10-08', '10-15']
dest = ['FCO', 'PRG', 'BCN', 'TXL']
dest_large = ['FCO_IT', 'PRG_CZ', 'BCN_ES', 'TXL_DE']

counter = 0
while True:
    for date in dep_date:
        for i, destination in enumerate(dest):
            counter += 1
            try:
                loader(destination, dest_large[i], date)
                time.sleep(30)
            except req.HTTPError as err:
                if err.code == 403: # wait if IP has been banned
                    print(date, destination, "Error 403. Ban, wait 120 minutes!")
                    time.sleep(120 * 60)
                if err.code == 404:
                    print(date, destination, "404. Can't open!")
                    time.sleep(30)
            if counter == 25:
                time.sleep(60 * 35)
                counter = 0

0 个答案:

没有答案