Python - 使用BeautifulSoup刮取ESPN表

时间:2016-12-21 02:36:22

标签: python web-scraping beautifulsoup

我试图使用BeautifulSoup来收集季节统计数据" this页面上的表格。有什么方法可以把整个桌子变成一个汤对象吗?目前我的代码是这样的:

seasonStats = soup.find('table', {'id': 'statsTable'})
categoryList = seasonStats.findAll('tr')[2].findAll('a')

我遇到的问题是FG%,FT%,3PM,REB,AST,STL,BLK,TO,PTS存储在一行,但RK,LAST,MOVES存储在另一行。无论如何我可以正确地刮掉整个表格,其中RK,TEAM,FG%,FT%,3PM,REB,AST,STL,BLK,TO,PTS,LAST,MOVES都存储在一行(categoryList)? ESPN甚至将这些值放在不同的行上似乎很愚蠢。而且,如果我能将整个表格放到一个矩阵中,那将会非常有帮助。

期望的输出:

['RK', 'TEAM', 'FG%', 'FT%', '3PM', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PTS', 'LAST', 'MOVES']
['1', 'Team Li', '.4656', '.8049', '437', '1752', '962', '284', '228', '578', '4804', '4-4-1', '12']
['2', 'Team Aguilar', '.4499', '.7727', '415', '1925', '737', '276', '292', '543', '4901', '4-4-1', '0']
['3', 'Suck MyDirk', '.4533', '.8083', '410', '1798', '1035', '367', '153', '658', '5331', '3-6-0', '8']
['4', 'Knicks Tape', '.4589', '.8057', '339', '1458', '1029', '285', '132', '566', '4304', '4-5-0', '12']
['5', 'Kris Kaman  His Pants', '.4576', '.8068', '534', '1530', '940', '306', '115', '515', '4603', '5-4-0', '17']
['6', 'Nutz Vs. Draymond Green', '.4518', '.8000', '404', '1641', '1004', '270', '176', '620', '4554', '5-4-0', '12']
['7', 'Team Keyrouze', '.4548', '.7895', '441', '1558', '809', '293', '195', '510', '4874', '4-5-0', '11']
['8', 'In Porzingod We Trust', '.4607', '.7542', '275', '1699', '1020', '274', '252', '482', '4119', '6-3-0', '13']
['9', 'Team Iannetta', '.4706', '.7908', '260', '1800', '1026', '310', '137', '646', '4909', '8-1-0', '13']
['10', "Jesse's Blue Balls", '.4646', '.6766', '403', '2029', '505', '243', '238', '481', '3929', '5-4-0', '16']
['11', 'Team Pauls 2 da Wall', '.4531', '.7602', '313', '1797', '1197', '313', '268', '525', '3719', '6-3-0', '13']
['12', 'YOU REACH, I TEACH', '.4552', '.7591', '401', '1488', '997', '285', '108', '521', '3694', '4-5-0', '12']
['13', 'Team Noey', '.4740', '.7610', '273', '1821', '681', '301', '226', '491', '4059', '3-6-0', '9']
['14', 'Team Jackson', '.4325', '.7484', '206', '1104', '714', '174', '101', '383', '2532', '1-8-0', '4']

当前输出:

['1', 'Team Li', '.4656', '.8049', '437', '1752', '962', '284', '228', '578', '4804', '4-4-1', '12']
['2', 'Team Aguilar', '.4499', '.7727', '415', '1925', '737', '276', '292', '543', '4901', '4-4-1', '0']
['3', 'Suck MyDirk', '.4533', '.8083', '410', '1798', '1035', '367', '153', '658', '5331', '3-6-0', '8']
['4', 'Knicks Tape', '.4589', '.8057', '339', '1458', '1029', '285', '132', '566', '4304', '4-5-0', '12']
['5', 'Kris Kaman  His Pants', '.4576', '.8068', '534', '1530', '940', '306', '115', '515', '4603', '5-4-0', '17']
['6', 'Nutz Vs. Draymond Green', '.4518', '.8000', '404', '1641', '1004', '270', '176', '620', '4554', '5-4-0', '12']
['7', 'Team Keyrouze', '.4548', '.7895', '441', '1558', '809', '293', '195', '510', '4874', '4-5-0', '11']
['8', 'In Porzingod We Trust', '.4607', '.7542', '275', '1699', '1020', '274', '252', '482', '4119', '6-3-0', '13']
['9', 'Team Iannetta', '.4706', '.7908', '260', '1800', '1026', '310', '137', '646', '4909', '8-1-0', '13']
['10', "Jesse's Blue Balls", '.4646', '.6766', '403', '2029', '505', '243', '238', '481', '3929', '5-4-0', '16']
['11', 'Team Pauls 2 da Wall', '.4531', '.7602', '313', '1797', '1197', '313', '268', '525', '3719', '6-3-0', '13']
['12', 'YOU REACH, I TEACH', '.4552', '.7591', '401', '1488', '997', '285', '108', '521', '3694', '4-5-0', '12']
['13', 'Team Noey', '.4740', '.7610', '273', '1821', '681', '301', '226', '491', '4059', '3-6-0', '9']
['14', 'Team Jackson', '.4325', '.7484', '206', '1104', '714', '174', '101', '383', '2532', '1-8-0', '4']

非常感谢。

2 个答案:

答案 0 :(得分:1)

import requests, bs4
url = 'http://games.espn.com/fba/standings?leagueId=224165&seasonId=2017'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')

table = soup.find(id="statsTable")
rows = table.find_all(class_=["tableBody sortableRow","tableSubHead"])

rows = iter(rows)
header_1 = [td.text for td in next(rows).find_all('td') if td.text]
header_2 = [td.text for td in next(rows).find_all('td') if td.text]
header = header_1[:2] + header_2 + header_1[-2:]
print(header)
for row in rows:
    data = [td.text for td in row.find_all('td') if td.text]
    print(data)

出:

['RK', 'TEAM', 'FG%', 'FT%', '3PM', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PTS', 'LAST', 'MOVES']
['1', 'Team Li', '.4656', '.8049', '437', '1752', '962', '284', '228', '578', '4804', '4-4-1', '12']
['2', 'Team Aguilar', '.4499', '.7727', '415', '1925', '737', '276', '292', '543', '4901', '4-4-1', '0']
['3', 'Suck MyDirk', '.4533', '.8083', '410', '1798', '1035', '367', '153', '658', '5331', '3-6-0', '8']
['4', 'Knicks Tape', '.4589', '.8057', '339', '1458', '1029', '285', '132', '566', '4304', '4-5-0', '12']
['5', 'Kris Kaman  His Pants', '.4576', '.8068', '534', '1530', '940', '306', '115', '515', '4603', '5-4-0', '17']
['6', 'Nutz Vs. Draymond Green', '.4518', '.8000', '404', '1641', '1004', '270', '176', '620', '4554', '5-4-0', '12']
['7', 'Team Keyrouze', '.4548', '.7895', '441', '1558', '809', '293', '195', '510', '4874', '4-5-0', '11']
['8', 'In Porzingod We Trust', '.4607', '.7542', '275', '1699', '1020', '274', '252', '482', '4119', '6-3-0', '13']
['9', 'Team Iannetta', '.4706', '.7908', '260', '1800', '1026', '310', '137', '646', '4909', '8-1-0', '13']
['10', "Jesse's Blue Balls", '.4646', '.6766', '403', '2029', '505', '243', '238', '481', '3929', '5-4-0', '17']
['11', 'Team Pauls 2 da Wall', '.4531', '.7602', '313', '1797', '1197', '313', '268', '525', '3719', '6-3-0', '13']
['12', 'YOU REACH, I TEACH', '.4552', '.7591', '401', '1488', '997', '285', '108', '521', '3694', '4-5-0', '12']
['13', 'Team Noey', '.4740', '.7610', '273', '1821', '681', '301', '226', '491', '4059', '3-6-0', '9']
['14', 'Team Jackson', '.4325', '.7484', '206', '1104', '714', '174', '101', '383', '2532', '1-8-0', '4']

答案 1 :(得分:0)

我认为你错了。一个团队的所有数据看起来都在同一个tr。这是第一个,删除了所有样式:

<tr>
  <td id="sovrRk_9">1</td>
  <td><a title="Team Li (Royce Li)" href="...">Team Li</a></td>
  <td><spacer type="block" width="1" height="1"> </spacer>
  </td>
  <td id="tmTotalStat_9_19">.4656</td>
  <td id="tmTotalStat_9_20">.8049</td>
  <td id="tmTotalStat_9_17">437</td>
  <td id="tmTotalStat_9_6">1752</td>
  <td id="tmTotalStat_9_3">962</td>
  <td id="tmTotalStat_9_2">284</td>
  <td id="tmTotalStat_9_1">228</td>
  <td id="tmTotalStat_9_11">578</td>
  <td id="tmTotalStat_9_0">4804</td>
  <td>4-4-1</td>
  <td  title="Season Moves">12</td>
</tr>

一切都在那里。