我正在尝试从此网页的表格中提取表格数据。 http://www.espn.com/college-sports/basketball/recruiting/playerrankings/_/view/espnu100/sort/rank/class/2019。但是,当我尝试从每一行中提取表数据时,似乎无法从每一行中获取数据。我检测到的一种模式是无法看到有图像的行的表数据。我还有其他方法可以抓取我想要的数据(位置,家乡,等级等),尤其是当涉及到存在图片的数据时?
下面的当前代码:
# We are unable to get the table data row for individuals that have a picture
rows = soup.find_all('tr')
# This is the data for each table row
for row in rows:
print(row.text)
我已经可以使用“ div”类获取球员姓名,但我认为我无法将其用于其他列中的数据。
# The name of the player is going to be our first column. So let's make a list of the names for tr in soup.find_all('tr'):
name = soup.find_all('div', {'class':'name'})
# Empty list to put our player name in.
players = []
for person in name:
# This line is used to remove the below text that was also included wth the names
person = person.text.replace("Video | Scouts Report","")
players.append(person)
# We see that the length of this list is 100. This means that we extraced the names correctly
len(players)
答案 0 :(得分:1)
我认为一种更简单的方法是直接使用pd.read_html()
在大熊猫DataFrame
中读取它,这将立即从URL返回所有表的(长度为1)列表:
url = r'http://www.espn.com/college-sports/basketball/recruiting/playerrankings/_/view/espnu100/sort/rank/class/2019'
dfs = pd.read_html(url, header=0)
dfs[0].head()
# RK PLAYER POS \
#0 1 James WisemanVideo | Scouts Report C
#1 2 Cole AnthonyVideo | Scouts Report PG
#2 3 Vernon Carey Jr.Video | Scouts Report C
#3 4 Isaiah StewartVideo | Scouts Report C
#4 5 Anthony EdwardsVideo | Scouts Report SG
#
# HOMETOWN HT WT STARS GRADE \
#0 Memphis, TNEast High School 7'0'' 230 NaN 97
#1 Briarwood, NYOak Hill Academy 6'3'' 185 NaN 97
#2 Southwest Ranches, FLNSU University School 6'10'' 275 NaN 97
#3 Rochester, NYLa Lumiere School 6'9'' 245 NaN 97
#4 Atlanta, GAHoly Spirit School 6'4'' 205 NaN 97
#
# SCHOOL
#0 MemphisSigned
#1 List
#2 DukeCommitted12/06/2018
#3 WashingtonCommitted01/20/2019
#4 GeorgiaCommitted02/11/2019
当然,您必须进行一些清理,但是我认为这比将所有内容读入列表要有效得多。