Python Beautifulsoup4网站解析

时间:2014-02-01 18:37:09

标签: python web-scraping beautifulsoup

我正在尝试使用Beautifulsoup4从网站上搜索一些体育数据,但我在查找如何进行操作时遇到了一些麻烦。我对HTML并不是那么出色,似乎无法弄清楚最后一点必要的语法。解析数据后,我将把它插入Pandas数据帧。我正在努力提取主队,客场球队和得分。到目前为止,这是我的代码:

from bs4 import BeautifulSoup
import urllib2
import csv

url = 'http://www.bbc.com/sport/football/premier-league/results'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

def has_class_but_no_id(tag):
    return tag.has_attr('score')

writer = csv.writer(open("webScraper.csv", "w"))

for tag in soup.find_all('span', {'class':['team-away', 'team-home', 'score']}):
    print(tag)

这是一个示例输出:

<span class="team-home teams">
<a href="/sport/football/teams/newcastle-united">Newcastle</a> </span>
<span class="score"> <abbr title="Score"> 0-3 </abbr> </span>
<span class="team-away teams">
<a href="/sport/football/teams/sunderland">Sunderland</a> </span>

我需要将主队(纽卡斯尔队),得分(0-3)和客场球队(桑德兰)存放在三个不同的领域。从本质上讲,我一直试图从每个标签中提取“值”,似乎无法弄清楚bs4中的语法。我需要一个tag.value属性,但我在文档中找到的只有tag.nametag.attrs。任何帮助或指示将不胜感激!

3 个答案:

答案 0 :(得分:3)

每个得分单元都位于<td class='match-details'>元素内,循环遍历那些以提取匹配详细信息。

从那里,您可以使用.stripped_strings生成器从子元素中提取文本;只需将其传递给''.join()即可获取标记中包含的所有字符串。分别选择team-homescoreteam-away以便于解析:

for match in soup.find_all('td', class_='match-details'):
    home_tag = match.find('span', class_='team-home')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='score')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='team-away')
    away = away_tag and ''.join(away_tag.stripped_strings)

使用额外的print,即可:

>>> for match in soup.find_all('td', class_='match-details'):
...     home_tag = match.find('span', class_='team-home')
...     home = home_tag and ''.join(home_tag.stripped_strings)
...     score_tag = match.find('span', class_='score')
...     score = score_tag and ''.join(score_tag.stripped_strings)
...     away_tag = match.find('span', class_='team-away')
...     away = away_tag and ''.join(away_tag.stripped_strings)
...     if home and score and away:
...         print home, score, away
... 
Newcastle 0-3 Sunderland
West Ham 2-0 Swansea
Cardiff 2-1 Norwich
Everton 2-1 Aston Villa
Fulham 0-3 Southampton
Hull 1-1 Tottenham
Stoke 2-1 Man Utd
Aston Villa 4-3 West Brom
Chelsea 0-0 West Ham
Sunderland 1-0 Stoke
Tottenham 1-5 Man City
Man Utd 2-0 Cardiff
# etc. etc. etc.

答案 1 :(得分:1)

您可以使用 tag.string 属性获取标记值。

有关详细信息,请参阅文档。 http://www.crummy.com/software/BeautifulSoup/bs4/doc/

答案 2 :(得分:0)

由于重定向到此处: https://www.bbc.com/sport/football/premier-league/scores-fixtures

这是对已接受答案的更新,仍然是正确的。如果您编辑答案可以ping通,我将删除该答案。

for match in soup.find_all('article', class_='sp-c-fixture'):
    home_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-home').find('span').find('span')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='sp-c-fixture__number sp-c-fixture__number--time')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-away').find('span').find('span')
    away = away_tag and ''.join(away_tag.stripped_strings)
    if home and score and away:
        print(home, score, away)