所以,我试图从S7航空公司网站获得票价(然后我想在机器学习的帮助下尝试预测价格变化)。
这个小解析器获取价格页面,然后在BeautifulSoup的帮助下找到所选日期的最低价格,然后将其保存在.csv文件中。
在"冷却"的请求之间30秒服务器。如果获得403错误,则等待120分钟并继续。每25个请求停止"冷却"。问题是我经常遇到404错误。我在浏览器中查看了一些链接并且它们正常工作 - 页面已打开,一切正常。当我尝试只打开一个链接时,解析器会不时地执行此操作,但通常也会返回404错误。
我尝试添加超时,但它没有帮助。
我是编程新手,可能会犯明显的错误。
提前谢谢你。抱歉我的英语:)
from bs4 import BeautifulSoup as BS
import urllib.request as req
import re
import time
from urllib.error import HTTPError
def loader(dest, dest_large, dep_date):
url = "http://travelwith.s7.ru/selectExactDateSearchFlights.action?TA=1&TC=0&TI=0&" \
"CUR=RUB&FLC=1&FLX=false&RDMPTN=false&SC1=ANY&FSC1=1&DD1=2016-%s&" \
"DA1=SVX&DP1=AIR_SVX_RU&AA1=%s&AP1=AIR_%s&LAN=ru" % (dep_date, dest, dest_large)
request = req.Request(url,
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0'})
avia = req.urlopen(request).read()
soup = BS(avia, 'html.parser')
if len([re.sub('[^0-9]', '', p.get_text()) for p in soup.find_all("span", class_="radiobutton-text")]) > 0:
price = min([re.sub('[^0-9]', '', p.get_text()) for p in soup.find_all("span", class_="radiobutton-text")])
else:
price = 0
with open('prices/SVX_%s_%s.csv' % (dest, dep_date), 'a', encoding="utf8") as file:
file.write(str(time.time())+';'+str(price)+'\n')
print(date, destination, "OK!",price)
dep_date = ['06-08', '06-15', '06-22', '06-29', '07-01', '07-08', '07-15',
'07-22', '07-29', '08-01', '08-08', '08-15', '08-22', '08-29',
'09-01', '09-08', '09-15', '09-22', '09-29', '10-01', '10-08', '10-15']
dest = ['FCO', 'PRG', 'BCN', 'TXL']
dest_large = ['FCO_IT', 'PRG_CZ', 'BCN_ES', 'TXL_DE']
counter = 0
while True:
for date in dep_date:
for i, destination in enumerate(dest):
counter += 1
try:
loader(destination, dest_large[i], date)
time.sleep(30)
except req.HTTPError as err:
if err.code == 403: # wait if IP has been banned
print(date, destination, "Error 403. Ban, wait 120 minutes!")
time.sleep(120 * 60)
if err.code == 404:
print(date, destination, "404. Can't open!")
time.sleep(30)
if counter == 25:
time.sleep(60 * 35)
counter = 0