使用Beautiful soup从维基百科中提取数据

时间:2014-08-05 11:16:50

标签: python beautifulsoup

我有一个维基百科页面:http://en.wikipedia.org/wiki/2014_AFL_season

我需要的是准备一个字典,其中Round为密钥,相应的数据为其值。

像:

myDict = {"Round 1": [["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"],  ["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"].......], "Round 2":[["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"],  ["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"].......]

所以这个词典将存储所有数据。

请帮我这样做。我在Python中使用BS4和urllib2。

我使用了以下代码:

from bs4 import BeautifulSoup
import urllib2



header = {'User-Agent': 'Mozilla/5.0'}

def createLink():

    url = "http://en.wikipedia.org/wiki/2014_AFL_season"

#     mainPage = urllib2.Request(url,headers=header)

    mainPage = urllib2.urlopen(url)

    mainPageSoup = BeautifulSoup(mainPage)

    for index in mainPageSoup.findAll("table"):
        print index

createLink()

1 个答案:

答案 0 :(得分:0)

利用每个表前面带有圆H3元素的事实:

rounds = {}

for table in soup.select('h3 + table'):
    round_name = table.find_previous_sibling('h3').span.get_text().strip()
    if not round_name.lower().startswith('round'):
        break  # all rounds found
    entries = []
    for row in table.find_all('tr', style=False):
        cells = row.find_all('td')
        if len(cells) < 5:
            continue
        date = cells[0].get_text()
        loser = cells[1].a.get_text()
        winner = cells[3].a.get_text()
        venue = cells[4].a.get_text()
        crowd = cells[4].a.next_sibling.strip(' \n()')
        rounds[round_name] = [date, loser, winner, venue, crowd]