多个网址的网页抓取浏览器错误

时间:2021-03-27 12:51:36

标签: python selenium-webdriver web-scraping

此代码适用于单个网址,但使用多个网址时,出现错误

import os
import pandas as pd
from selenium import webdriver
from tabulate import tabulate
from datetime import datetime
import time
from bs4 import BeautifulSoup as bs

start = datetime.now()
browser = webdriver.Chrome()

class GameData:

    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []


def parse_data(url):
    browser.get(url)
    df = pd.read_html(browser.page_source, header=0)[0]
    html = browser.page_source
    soup = bs(html,"lxml")
    cont = soup.find('div', {'id':'wrap'})
    conti = cont.find('div', {'id':'col-content'})
    content = conti.find('table', {'class':'table-main'}, {'id':'tournamentTable'})
    main = content.find('th', {'class':'first2 tl'})
    count = main.findAll('a')
    country = count[1].text
    league = count[2].text
    game_data = GameData()
    game_date = None
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            game_date = row[1].split('-')[0]
            continue
        game_data.date.append(game_date)
        game_data.time.append(row[1])
        game_data.game.append(row[2])
        game_data.score.append(row[3])
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])
    browser.quit()
    return game_data, country, league

# You can input as many URLs you want
urls = {
"https://www.oddsportal.com/soccer/europe/champions-league/results/",
"https://www.oddsportal.com/soccer/australia/a-league/results/#/page/1/",
"https://www.oddsportal.com/soccer/belgium/jupiler-league/results/#/page/1/",
"https://www.oddsportal.com/soccer/czech-republic/1-liga/results/#/page/1/",
}

if __name__ == '__main__':

    results = None

    for url in urls:
        game_data, country, competition = parse_data(url)
        result = pd.DataFrame(game_data.__dict__)
        result['country'] = country
        result['competition'] = competition
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

print(tabulate(results.head(), headers='keys', tablefmt="github"))
end = datetime.now()
time_taken = end - start
print('Time taken to complete: ', time_taken)

浏览器不循环到下一个网址并关闭

提高 MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=54008): 最大重试次数超过 url: /session/6c9f4ce81beb95e93f2cc32858dfb114/url (由 NewConnectionError('

如何为多个网址运行它?

另外,我怎样才能在 game_data 中获得 [country] 和 [competition] ?

0 个答案:

没有答案