我有一个维基百科页面:http://en.wikipedia.org/wiki/2014_AFL_season
我需要的是准备一个字典,其中Round为密钥,相应的数据为其值。
像:
myDict = {"Round 1": [["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"], ["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"].......], "Round 2":[["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"], ["Date","Loser Team ", "Winner Team ","Stadium", "Crowd"].......]
所以这个词典将存储所有数据。
请帮我这样做。我在Python中使用BS4和urllib2。
我使用了以下代码:
from bs4 import BeautifulSoup
import urllib2
header = {'User-Agent': 'Mozilla/5.0'}
def createLink():
url = "http://en.wikipedia.org/wiki/2014_AFL_season"
# mainPage = urllib2.Request(url,headers=header)
mainPage = urllib2.urlopen(url)
mainPageSoup = BeautifulSoup(mainPage)
for index in mainPageSoup.findAll("table"):
print index
createLink()
答案 0 :(得分:0)
利用每个表前面带有圆H3
元素的事实:
rounds = {}
for table in soup.select('h3 + table'):
round_name = table.find_previous_sibling('h3').span.get_text().strip()
if not round_name.lower().startswith('round'):
break # all rounds found
entries = []
for row in table.find_all('tr', style=False):
cells = row.find_all('td')
if len(cells) < 5:
continue
date = cells[0].get_text()
loser = cells[1].a.get_text()
winner = cells[3].a.get_text()
venue = cells[4].a.get_text()
crowd = cells[4].a.next_sibling.strip(' \n()')
rounds[round_name] = [date, loser, winner, venue, crowd]