我是Python新手并且想要使用BeautifulSoup来抓取此网址中的表:http://www.espn.com/college-sports/basketball/recruiting/databaseresults?firstname=&lastname=&class=2007&starsfilter=GT&stars=0&ratingfilter=GT&rating=&positionrank=&sportid=4294967265&collegeid=&conference=&visitmonth=&visityear=&statuscommit=Commitments&statusuncommit=Uncommited&honor=®ion=&state=&height=&weight=
到目前为止,我已经想出了如何为每个玩家的行提取表格数据,以及每行中学校徽标的链接。但是,我在将两者结合起来时遇到了麻烦。我想提取每个玩家的表数据(下面代码中为player_data
)以及相应的学校徽标图像链接(logo_links
),并在保存的CSV中为每个玩家分为一行
以下是我到目前为止的情况。在此先感谢您的帮助。
#! python3
# downloadRecruits.py - Downloads espn college basketball recruiting database info
import requests, os, bs4, csv
import pandas as pd
# Starting url (class of 2007)
url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults?firstname=&lastname=&class=2007&starsfilter=GT&stars=0&ratingfilter=GT&rating=&positionrank=&sportid=4294967265&collegeid=&conference=&visitmonth=&visityear=&statuscommit=Commitments&statusuncommit=Uncommited&honor=®ion=&state=&height=&weight='
# Download the page
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
# Creating bs object
soup = bs4.BeautifulSoup(res.text, "html.parser")
# Get the data
data_rows = soup.findAll('tr')[1:]
type(data_rows)
player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]
logo_links = [a['href'] for div in soup.find_all("div", attrs={"class": "school-logo"}) for a in div.find_all('a')]
# Saving only player_data
with open('recruits2.csv', 'w') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(player_data)
答案 0 :(得分:3)
我会这样做。
原因1:您不必在HTML中查找两次内容
原因2:在原因1之后,您不必再次运行循环。
player_data = []
for tr in data_rows:
tdata = []
for td in tr:
tdata.append(td.getText())
if td.div and td.div['class'][0] == 'school-logo':
tdata.append(td.div.a['href'])
player_data.append(tdata)
小解释 -
主要是,我没有使用列表理解,因为if
块在div
中查找具有所需类名的HTML
块,如果有,则附加到它在tr
代码中收集的数据列表。
答案 1 :(得分:2)
要将logo_links
元素附加到player_data
中的每个列表,您可以执行以下操作:
>>> i = 0
>>> for p in player_data:
p.append(logo_links[i])
i+=1