我正在尝试从此网站提取数据:http://www.afl.com.au/fixture
以某种方式,我有一个字典,其日期为关键,“预览”链接为列表中的值,如
dict = {Saturday, June 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}
请帮我搞定,我使用了以下代码:
def extractData():
lDateInfoMatchCase = False
# lDateInfoMatchCase = []
global gDict
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
ldateList.append(lDateRowIndex.text)
print ldateList
for index in ldateList:
#print index
lPreviewLinkList = []
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
if lDateRowIndex.text == index:
lDateInfoMatchCase = True
else:
lDateInfoMatchCase = False
if lDateInfoMatchCase == True:
for lInfoRowIndex in row.findAll("td", {"class": "info"}):
for link in lInfoRowIndex.findAll("a", {"class" : "preview"}):
lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href'))
print lPreviewLinkList
gDict[index] = lPreviewLinkList
我的主要目标是根据数据结构中的日期,获得在主队和客队中进行比赛的所有球员姓名。
答案 0 :(得分:0)
我更喜欢使用CSS Selectors。选择第一个表,然后选择tbody
中的所有行以便于处理;行被分组'按tr th
行。从那里,您可以选择所有不包含th
标题的兄弟姐妹,并扫描这些标题以获取预览链接:
previews = {}
table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
date = group_header.string
for next_sibling in group_header.parent.find_next_siblings('tr'):
if next_sibling.th:
# found a next group, end scan
break
for preview in next_sibling.select('a.preview'):
previews.setdefault(date, []).append(
"http://www.afl.com.au" + preview.get('href'))
这会构建一个列表字典;对于当前版本的页面,这会产生:
{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}