使用美丽的汤获取统计数据

时间:2014-08-07 05:22:02

标签: python web-scraping html-parsing beautifulsoup

我正在使用此页面:http://www.afl.com.au/match-centre/2014/20/rich-v-esshttp://www.afl.com.au/match-centre/2014/19/fre-v-carl

获取右侧给出的统计数据。我在下面附了一个屏幕截图。

enter image description here

我使用了以下函数来获取数据。我创建了汤并将其作为参数传递给下面的函数

def fetchStats(soup):

    for i in soup.findAll("div", {"class" : "module", "id" : "season-stats"}):
        for index in i.findAll("div", {"class" : "module-content"}):
            for item in index.findAll("ul", style=False):

                li = item.findAll("li", {"class" : "major"})
                 print item.getText()
            break
        break

但它没有按照我的意愿行事。我需要将团队的所有参数存储在字典中,其中字典的键将是团队名称,其值将具有包含两个成员的元组 - 参数名称及其值,例如,

dic = {"Team 1 Name": [("Disposlas",311), ("Kicks", 190) .....], "Team 2 Name" : [("Disposlas",315), ("Kicks", 224) .....]}

请帮帮我。

1 个答案:

答案 0 :(得分:0)

我们的想法是获取所有主队统计数据,远离球队统计数据和统计数据名称并将其放入单独的列表中,然后zip()

from pprint import pprint
from urllib2 import urlopen
from bs4 import BeautifulSoup


url = "http://www.afl.com.au/match-centre/2014/20/rich-v-esshttp://www.afl.com.au/match-centre/2014/19/fre-v-carl"
soup = BeautifulSoup(urlopen(url))

# get team names
home_team_name = soup.find('div', class_='home-team').p.a.text.strip()
away_team_name = soup.find('div', class_='away-team').p.a.text.strip()

# get stats
season_stats = soup.find('div', id='season-stats')
home_stats = [li.text for li in season_stats.select('ul#home-team-stats > li')]
away_stats = [li.text for li in season_stats.select('ul#away-team-stats > li')]
params = [li.text for li in season_stats.select('ul.headers > li')]

stats = {home_team_name: zip(params, home_stats),
         away_team_name: zip(params, away_stats)}

pprint(stats)

打印:

{u'Carlton': [(u'Disposals', u'315'),
              (u'Kicks', u'224'),
              (u'Handballs', u'91'),
              (u'Free Kicks', u'22'),
              (u'Clearances', u'39'),
              (u'Centre', u'13'),
              (u'Stoppages', u'26'),
              (u'Inside 50', u'40'),
              (u'Marks in 50', u'8'),
              (u'Contested Possessions', u'127'),
              (u'Tackles', u'59'),
              (u'Hit-Outs', u'23'),
              (u'Interchanges', u'111')],
 u'Fremantle': [(u'Disposals', u'311'),
                (u'Kicks', u'190'),
                (u'Handballs', u'121'),
                (u'Free Kicks', u'14'),
                (u'Clearances', u'38'),
                (u'Centre', u'10'),
                (u'Stoppages', u'28'),
                (u'Inside 50', u'47'),
                (u'Marks in 50', u'13'),
                (u'Contested Possessions', u'124'),
                (u'Tackles', u'63'),
                (u'Hit-Outs', u'75'),
                (u'Interchanges', u'96')]}

仅供参考,您可以在此处详细了解这些select()来电:CSS Selectors