我希望从如下页面中抓取数据:
http://elofootball.com/country.php?countryiso=ENG&season=2007-2008
具体来说,我想要第 3 个表中所有俱乐部的 W/D/L 统计数据。您必须按“统计”按钮才能显示。我希望在 10 年内抓取大约 55 个国家/地区的数据。
我开发了一些代码来使用 BeautifulSoup 抓取类似的网站。该代码适用于该站点,但绝大多数单元格返回空白。一些随机的会返回数据。
我猜这是因为页面使用了 javascript 查询?我听说过使用 selenium,但需要一些帮助来进行设置和/或想知道是否有更快的方法,因为我认为 Selenium 需要花费大量时间来浏览大约 500 个网页。
这是我的代码:
import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')
然后我创建了一个字典列表。例如,
#create a dictionary for league and iterate over seasons
dct_AUT = {}
dct_BEL = {}
dct_BGR = {}
for m in range(2007,2019):
dct_AUT['df_AUT_%s' % m] = pd.DataFrame()
dct_BEL['df_BEL_%s' % m] = pd.DataFrame()
dct_BGR['df_BGR_%s' % m] = pd.DataFrame()
然后我遍历 URL:
#list of URL bases for each league
league_urls = (['http://elofootball.com/country.php?countryiso=AUT&season=',
'http://elofootball.com/country.php?countryiso=BEL&season=',
'http://elofootball.com/country.php?countryiso=BGR&season='])
这是抓取部分:
#Scraping part
#The first loop is for each url in our URL-list
for l in range(0, len(league_urls)):
time.sleep(0.5)
#The second loop is for each year we want to scrape
for n in range(2007,2019):
time.sleep(0.5)
df_soccer1 = None
url = league_urls[l] + str(n) + str('-') + str(n+1)
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url, headers=headers, verify=False)
time.sleep(0.5)
soup = BeautifulSoup(response.text, 'lxml')
#Table 1 with information about the value
table = soup.find("div", {"id" : "Stats"})
team = []
totalp = []
totalw = []
totald = []
totall = []
totalgf = []
totalga = []
tier = []
leaguep = []
leaguew = []
leagued = []
leaguel = []
leaguegf = []
leaguega = []
leaguepts = []
cupp = []
cupw = []
cupd = []
cupl = []
cupgf = []
cupga = []
europ = []
eurow = []
eurod = []
eurol = []
eurogf = []
euroga = []
for row in table.find_all('tr'):
try:
col = row.find_all('td')
team_ = col[1].text
totalp = col[3].text
totalw = col[4].text
totald = col[5].text
totall = col[6].text
totalgf = col[7].text
totalga = col[8].text
tier = col[9].text
leaguep = col[10].text
leaguew = col[11].text
leagued = col[12].text
leaguel = col[13].text
leaguegf = col[14].text
leaguega = col[15].text
leaguepts = col[16].text
cupp = col[17].text
cupw = col[18].text
cupd = col[19].text
cupl = col[20].text
cupgf = col[21].text
cupga = col[22].text
europ = col[23].text
eurow = col[24].text
eurod = col[25].text
eurol = col[26].text
eurogf = col[27].text
euroga = col[28].text
team.append(team_)
totalp.append(totalp_)
totalw.append(totalw_)
totald.append(totald_)
totall.append(totall_)
totalgf.append(totalgf_)
totalga.append(totalga_)
tier.append(tier_)
leaguep.append(leaguep_)
leaguew.append(leaguew_)
leagued.append(leagued_)
leaguel.append(leaguel_)
leaguegf.append(leaguegf_)
leaguega.append(leaguega_)
leaguepts.append(leaguepts_)
cupp.append(cupp_)
cupw.append(cupw_)
cupd.append(cupd_)
cupl.append(cupl_)
cupgf.append(cupgf_)
cupga.append(cupga_)
europ.append(europ_)
eurow.append(eurow_)
eurod.append(eurol_)
eurol.append(eurod_)
eurogf.append(eurogf_)
euroga.append(euroga_)
except:
pass
df_soccer1 = pd.DataFrame({'Team': team[1:], 'Season': [n]*(len(team)-1), 'totalp': totalp[1:], 'totalw': totalw[1:],
'totald': totald[1:], 'totall': totall[1:], 'totalgf': totalgf[1:], 'totalga': totalga[1:],
'tier': tier[1:], 'leaguep': leaguep[1:], 'leaguew': leaguew[1:], 'leagued': leagued[1:],
'leaguel': leaguel[1:], 'leaguegf': leaguegf[1:], 'leaguega': leaguega[1:],
'leaguepts': leaguepts[1:], 'cupp': cupp[1:], 'cupw': cupw[1:], 'cupd': cupd[1:],
'cupl': cupl[1:], 'cupgf': cupgf[1:], 'cupga': cupga[1:], 'europ': europ[1:],
'eurow': eurow[1:], 'eurod': eurod[1:], 'eurol': eurol[1:], 'eurogf': eurogf[1:],
'euroga': euroga[1:]})
#Store all dictionaries in a list
dct_all = [dct_AUT,dct_BEL,dct_BGR]
#Merge df_soccer1 and df_soccer2 for each season
dct_all[l]['df_bl_%s' % n] = df_soccer1
然后我将数据组合成一个数据集。
关于为什么我得到最多空白的任何想法?关于哪些是空白的,哪些是抓取数据的,似乎没有规律。