我正在使用BeautifulSoup进行网络抓取。我正在尝试从ESPN收集数据并将其保存到文件中。最终,我想解析数据,以便播放器保留每个统计信息。但是每10个播放器左右,就会使用三个额外的标签,这使得正确解析数据变得困难。当我使用find_all时,我尝试使用'td',然后我只想在align =“ right”时获取数据(这是我想要的数据,并删除了三个额外的标签)。但是,当我尝试将其添加到函数中时,它将无法正常工作。
这是网站:
url_base = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/" + str(yr) + '/count/'
这是我的代码:
def player_count(page_nums):
player_list = ['']
num_start = 1
for i in range(1, page_nums + 1):
num_start += 40
player_list.append(num_start)
return player_list
def num_pages(num_strings):
list_ = num_strings.split(" ")
return list_[2]
year = []
for i in range(2001, 2019):
year.append(i)
for yr in year:
if yr == 2005:
continue
url_base = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/" + str(yr) + '/count/'
for i in player_count_list:
url = url_base + str(i)
print(url)
with urlopen(url) as page:
html_doc = bS(page, 'html5lib')
table_string = html_doc.find_all('td', {'align': 'right'})
print(table_string)
new_strings = []
for td in table_string:
new_strings.append(td.text)
yr_file.append(new_strings)
浏览完文档后,似乎应该可以使用:
html_doc.find_all('td', {'align': 'right'})
但是我还没有得到它。
答案 0 :(得分:0)
我检查了您的URL,而不是<td>
标签,但是<tr>
和align=right
的行包含有用的数据。要提取它们,可以使用以下代码段:
yr_table = []
for i in player_count_list:
url = url_base + str(i)
print(url)
with urlopen(url) as page:
html_doc = bS(page, 'html5lib')
for tr in html_doc.select('tr[align="right"]'):
new_strings = []
for td in tr.find_all('td'):
new_strings.append(td.text.strip())
yr_table.append(new_strings)
pprint(yr_table, width=180) # from pprint import pprint
例如打印:
http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2007/count/0
[['RK', 'PLAYER', 'TEAM', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'PTS/G', 'SOG', 'PCT', 'GWG', 'G', 'A', 'G', 'A'],
['1', 'Sidney Crosby, C', 'PIT', '79', '36', '84', '120', '10', '60', '1.52', '250', '14.4', '4', '13', '48', '0', '0'],
['2', 'Joe Thornton, C', 'SJ', '82', '22', '92', '114', '24', '44', '1.39', '213', '10.3', '5', '10', '44', '0', '0'],
['3', 'Vincent Lecavalier, C', 'TB', '82', '52', '56', '108', '2', '44', '1.32', '339', '15.3', '7', '16', '20', '5', '4'],
['4', 'Dany Heatley, LW', 'OTT', '82', '50', '55', '105', '31', '74', '1.28', '310', '16.1', '10', '17', '22', '3', '1'],
['5', 'Martin St. Louis, RW', 'TB', '82', '43', '59', '102', '7', '28', '1.24', '273', '15.8', '7', '14', '16', '5', '6'],
['6', 'Marian Hossa, RW', 'ATL', '82', '43', '57', '100', '18', '49', '1.22', '340', '12.6', '5', '17', '27', '3', '1'],
['', 'Joe Sakic, C', 'COL', '82', '36', '64', '100', '2', '46', '1.22', '258', '14.0', '4', '16', '27', '0', '0'],
['8', 'Jaromir Jagr, RW', 'NYR', '82', '30', '66', '96', '26', '78', '1.17', '324', '9.3', '5', '7', '34', '0', '0'],
['', 'Marc Savard, C', 'BOS', '82', '22', '74', '96', '-19', '96', '1.17', '221', '10.0', '3', '10', '39', '1', '0'],
['10', 'Daniel Briere, C', 'BUF', '81', '32', '63', '95', '17', '89', '1.17', '234', '13.7', '6', '9', '21', '0', '0'],
['RK', 'PLAYER', 'TEAM', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'PTS/G', 'SOG', 'PCT', 'GWG', 'G', 'A', 'G', 'A'],
['11', 'Teemu Selanne, RW', 'ANH', '82', '48', '46', '94', '26', '82', '1.15', '257', '18.7', '10', '25', '23', '0', '0'],
['', 'Jarome Iginla, RW', 'CGY', '70', '39', '55', '94', '12', '40', '1.34', '264', '14.8', '7', '13', '20', '1', '0'],
['13', 'Alex Ovechkin, LW', 'WAS', '82', '46', '46', '92', '-19', '52', '1.12', '392', '11.7', '8', '16', '21', '0', '0'],
['14', 'Olli Jokinen, C', 'FLA', '82', '39', '52', '91', '18', '78', '1.11', '351', '11.1', '8', '9', '19', '1', '0'],
['15', 'Jason Spezza, C', 'OTT', '67', '34', '53', '87', '19', '45', '1.30', '162', '21.0', '5', '13', '20', '1', '1'],
['', 'Daniel Alfredsson, RW', 'OTT', '77', '29', '58', '87', '42', '42', '1.13', '240', '12.1', '7', '7', '18', '2', '2'],
['', 'Pavel Datsyuk, C', 'DET', '79', '27', '60', '87', '36', '20', '1.10', '207', '13.0', '5', '5', '24', '2', '1'],
['18', 'Evgeni Malkin, C', 'PIT', '78', '33', '52', '85', '2', '80', '1.09', '242', '13.6', '6', '16', '24', '0', '0'],
['19', 'Thomas Vanek, LW', 'BUF', '82', '43', '41', '84', '47', '40', '1.02', '237', '18.1', '5', '15', '7', '0', '0'],
['', 'Daniel Sedin, LW', 'VAN', '81', '36', '48', '84', '19', '36', '1.04', '236', '15.3', '8', '16', '18', '0', '0'],
['RK', 'PLAYER', 'TEAM', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'PTS/G', 'SOG', 'PCT', 'GWG', 'G', 'A', 'G', 'A'],
['21', 'Ray Whitney, LW', 'CAR', '81', '32', '51', '83', '-5', '46', '1.02', '215', '14.9', '6', '6', '24', '0', '0'],
['', 'Andrew Brunette, LW', 'COL', '82', '27', '56', '83', '-8', '36', '1.01', '173', '15.6', '2', '9', '27', '0', '0'],
['', 'Michael Nylander, C', 'NYR', '79', '26', '57', '83', '12', '42', '1.05', '193', '13.5', '4', '14', '23', '0', '0'],
['24', "Rod Brind'Amour, C", 'CAR', '78', '26', '56', '82', '7', '46', '1.05', '181', '14.4', '5', '9', '21', '2', '3'],
['25', 'Alex Tanguay, LW', 'CGY', '81', '22', '59', '81', '12', '44', '1.00', '107', '20.6', '0', '5', '16', '0', '0'],
['', 'Henrik Sedin, C', 'VAN', '82', '10', '71', '81', '19', '66', '0.99', '134', '7.5', '2', '1', '34', '0', '0'],
['27', 'Michael Cammalleri, LW', 'LA', '81', '34', '46', '80', '5', '48', '0.99', '299', '11.4', '5', '16', '21', '0', '0'],
['', 'Slava Kozlov, LW', 'ATL', '81', '28', '52', '80', '9', '36', '0.99', '190', '14.7', '8', '8', '29', '0', '2'],
['29', 'Patrick Marleau, C', 'SJ', '77', '32', '46', '78', '9', '33', '1.01', '180', '17.8', '9', '14', '23', '0', '0'],
['', 'Paul Stastny, C', 'COL', '82', '28', '50', '78', '4', '42', '0.95', '185', '15.1', '6', '11', '20', '0', '1'],
['RK', 'PLAYER', 'TEAM', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'PTS/G', 'SOG', 'PCT', 'GWG', 'G', 'A', 'G', 'A'],
['', 'Andy McDonald, C', 'ANH', '82', '27', '51', '78', '16', '46', '0.95', '252', '10.7', '3', '8', '25', '0', '0'],
['32', 'Kristian Huselius, LW', 'CGY', '81', '34', '43', '77', '21', '26', '0.95', '173', '19.7', '6', '14', '20', '2', '2'],
['', 'Daymond Langkow, C', 'CGY', '81', '33', '44', '77', '23', '44', '0.95', '247', '13.4', '6', '10', '17', '1', '2'],
['34', 'Ilya Kovalchuk, LW', 'ATL', '82', '42', '34', '76', '-2', '66', '0.93', '336', '12.5', '7', '18', '14', '0', '0'],
['', 'Mats Sundin, C', 'TOR', '75', '27', '49', '76', '-2', '62', '1.01', '321', '8.4', '3', '6', '28', '1', '0'],
['', 'Paul Kariya, LW', 'NSH', '82', '24', '52', '76', '6', '36', '0.93', '224', '10.7', '2', '5', '20', '0', '0'],
['37', 'Saku Koivu, C', 'MON', '81', '22', '53', '75', '-21', '74', '0.93', '154', '14.3', '4', '11', '32', '1', '2'],
['38', 'Alexander Semin, RW', 'WAS', '77', '38', '35', '73', '-7', '90', '0.95', '243', '15.6', '6', '17', '21', '0', '0'],
['39', 'Alexander Frolov, LW', 'LA', '82', '35', '36', '71', '-8', '34', '0.87', '195', '17.9', '6', '10', '18', '1', '2']]
... and so on