Question

我希望从如下页面中抓取数据：

http://elofootball.com/country.php?countryiso=ENG&season=2007-2008

具体来说，我想要第 3 个表中所有俱乐部的 W/D/L 统计数据。您必须按“统计”按钮才能显示。我希望在 10 年内抓取大约 55 个国家/地区的数据。

我开发了一些代码来使用 BeautifulSoup 抓取类似的网站。该代码适用于该站点，但绝大多数单元格返回空白。一些随机的会返回数据。

我猜这是因为页面使用了 javascript 查询？我听说过使用 selenium，但需要一些帮助来进行设置和/或想知道是否有更快的方法，因为我认为 Selenium 需要花费大量时间来浏览大约 500 个网页。

这是我的代码：

import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

然后我创建了一个字典列表。例如，

    #create a dictionary for league and iterate over seasons
    dct_AUT = {}
    dct_BEL = {}
    dct_BGR = {}
for m in range(2007,2019):
    dct_AUT['df_AUT_%s' % m] = pd.DataFrame() 
    dct_BEL['df_BEL_%s' % m] = pd.DataFrame()
    dct_BGR['df_BGR_%s' % m] = pd.DataFrame()

然后我遍历 URL：

#list of URL bases for each league
league_urls = (['http://elofootball.com/country.php?countryiso=AUT&season=',
                'http://elofootball.com/country.php?countryiso=BEL&season=',
                'http://elofootball.com/country.php?countryiso=BGR&season='])

这是抓取部分：

#Scraping part
#The first loop is for each url in our URL-list
for l in range(0, len(league_urls)):
    time.sleep(0.5)
    #The second loop is for each year we want to scrape
    for n in range(2007,2019):
        time.sleep(0.5)
        df_soccer1 = None
        url = league_urls[l] + str(n) + str('-') + str(n+1)
        headers = {"User-Agent":"Mozilla/5.0"}
        response = requests.get(url, headers=headers, verify=False)
        time.sleep(0.5)
        soup = BeautifulSoup(response.text, 'lxml')

        

        #Table 1 with information about the value
        table = soup.find("div", {"id" : "Stats"})

        team = []
        totalp = []
        totalw = []
        totald = []
        totall = []
        totalgf = []
        totalga = []
        tier = []
        leaguep = []
        leaguew = []
        leagued = []
        leaguel = []
        leaguegf = []
        leaguega = []
        leaguepts = []
        cupp = []
        cupw = []
        cupd = []
        cupl = []
        cupgf = []
        cupga = []
        europ = []
        eurow = []
        eurod = []
        eurol = []
        eurogf = []
        euroga = []
        


        for row in table.find_all('tr'):
            try:
                col = row.find_all('td')
                team_ = col[1].text
                totalp = col[3].text
                totalw = col[4].text
                totald = col[5].text
                totall = col[6].text
                totalgf = col[7].text
                totalga = col[8].text
                tier = col[9].text
                leaguep = col[10].text
                leaguew = col[11].text
                leagued = col[12].text
                leaguel = col[13].text
                leaguegf = col[14].text
                leaguega = col[15].text
                leaguepts = col[16].text
                cupp = col[17].text
                cupw = col[18].text
                cupd = col[19].text
                cupl = col[20].text
                cupgf = col[21].text
                cupga = col[22].text
                europ = col[23].text
                eurow = col[24].text
                eurod = col[25].text
                eurol = col[26].text
                eurogf = col[27].text
                euroga = col[28].text
                team.append(team_)
                totalp.append(totalp_)
                totalw.append(totalw_)
                totald.append(totald_)
                totall.append(totall_)
                totalgf.append(totalgf_)
                totalga.append(totalga_)
                tier.append(tier_)
                leaguep.append(leaguep_)
                leaguew.append(leaguew_)
                leagued.append(leagued_)
                leaguel.append(leaguel_)
                leaguegf.append(leaguegf_)
                leaguega.append(leaguega_)
                leaguepts.append(leaguepts_)
                cupp.append(cupp_)
                cupw.append(cupw_)
                cupd.append(cupd_)
                cupl.append(cupl_)
                cupgf.append(cupgf_)
                cupga.append(cupga_)
                europ.append(europ_)
                eurow.append(eurow_)
                eurod.append(eurol_)
                eurol.append(eurod_)
                eurogf.append(eurogf_)
                euroga.append(euroga_)
            except:
                pass

    

        df_soccer1 = pd.DataFrame({'Team': team[1:], 'Season': [n]*(len(team)-1), 'totalp': totalp[1:], 'totalw': totalw[1:],
                                    'totald': totald[1:], 'totall': totall[1:], 'totalgf': totalgf[1:], 'totalga': totalga[1:],
                                    'tier': tier[1:], 'leaguep': leaguep[1:], 'leaguew': leaguew[1:], 'leagued': leagued[1:],
                                    'leaguel': leaguel[1:], 'leaguegf': leaguegf[1:], 'leaguega': leaguega[1:],
                                    'leaguepts': leaguepts[1:], 'cupp': cupp[1:], 'cupw': cupw[1:], 'cupd': cupd[1:],
                                    'cupl': cupl[1:], 'cupgf': cupgf[1:], 'cupga': cupga[1:], 'europ': europ[1:],
                                    'eurow': eurow[1:], 'eurod': eurod[1:], 'eurol': eurol[1:], 'eurogf': eurogf[1:],
                                    'euroga': euroga[1:]})
        
        #Store all dictionaries in a list
        dct_all = [dct_AUT,dct_BEL,dct_BGR]
        
        #Merge df_soccer1 and df_soccer2 for each season
        dct_all[l]['df_bl_%s' % n] = df_soccer1

然后我将数据组合成一个数据集。

关于为什么我得到最多空白的任何想法？关于哪些是空白的，哪些是抓取数据的，似乎没有规律。

使用 Javascript 查询信息从网页中抓取数据

0 个答案: