我需要帮助循环遍历表行并将它们放入列表中。在这个网站上,有三个表,每个表都有不同的统计数据 - http://www.fangraphs.com/statsplits.aspx?playerid=15640&position=OF&season=0&split=0.4
例如,这三个表包含2016年,2017年的行和总行数。我想要以下内容:
以下列表 - >表1 - 第1行,第2行 - 第1行,第3行 - 第1行 以下的第二个列表 - >表1 - 第2行,第2行 - 第2行,第3行 - 第2行 第三个清单: - >表1 - 第3行,表2 - 第3行,第3行 - 第3行
我知道我显然需要创建列表,并且需要使用append函数;但是,我不知道如何让它循环遍历每个表的第一行,然后是每个表的第二行,等等到表的每一行(每个实例中的行数会有所不同 - 这一行恰巧有3)。
非常感谢任何帮助。代码如下:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
for id in idList2:
pos = 'OF'
for split in splitList:
url = 'http://www.fangraphs.com/statsplits.aspx?playerid=' +
str(id) + '&position=' + str(pos) + '&season=0&split=' +
str(split) + ''
r = requests.get(url)
for season in range(1,4):
print(season)
soup = BeautifulSoup(r.text, "html.parser")
tableStats = soup.find("table", {"id" : "SeasonSplits1_dgSeason" + str(season) + "_ctl00"})
soup.findAll('th')
column_headers = [th.getText() for th in soup.findAll('th')]
statistics = soup.find("table", {"id" :
'"SeasonSplits1_dgSeason" + str(season) + "_ctl00"})'
tabledata = [td.getText() for td in statistics('td')]
print(tabledata)
答案 0 :(得分:1)
这将是我的最后一次尝试。它有你应该需要的一切。我创建了一个回溯到表,行和列被刮的位置。这一切都发生在函数extract_table()
中。按照追溯标记,不要担心任何其他代码。不要让大文件大小让你担心它主要是文档和间距。
### ... ###
### START HERE ###
from bs4 import BeautifulSoup as Soup
import requests
import urllib
###### GO TO LINE 95 ######
### IGNORE ###
def generate_urls (idList, splitList):
""" Using and id list and a split list generate a list urls"""
urls = []
url = 'http://www.fangraphs.com/statsplits.aspx'
for id in idList:
for split in splitList:
# The parameters used in creating the url
url_payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
# Create the url and store add to the collection of urls
urls += ['?'.join([url, urllib.urlencode(url_payload)])]
return urls # Return the list of urls
### IGNORE ###
def extract_player_name (soup):
""" Extract the player name from the browser title """
# Browser title contains player name, strip all but name
player_name = repr(soup.title.text.strip('\r\n\t'))
player_name = player_name.split(' \\xbb')[0] # Split on ` »`
player_name = player_name[2:] # Erase a leading characters from using `repr`
return player_name
########## FINISH HERE ##########
def extract_table (table_id, soup):
""" Extract data from a table, return the column headers and the table rows"""
### IMPORTANT: THIS CODE IS WHERE ALL THE MAGIC HAPPENS ###
# - First: Find lowest level tag of all the data we want (container).
#
# - Second: Extract the table column headers, requires minimal mining
#
# - Third: Gather a list of tags that represent the tables rows
#
# - Fourth: Loop through the list of rows
# A): Mine all columns in the row
### IMPORTANT: Get A Reference To The Table ###
# SCRAPE 1:
table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})
# SCRAPE 2:
columns = [th.text for th in table_tag.findAll('th')]
# SCRAPE 3:
rows_tags = table_tag.tbody.findAll('tr'); # All 'tr' tags in the table `tbody` tag are row tags
### IMPORTANT: Cycle Through Rows And Collect Column Data ###
# SCRAPE 4:
rows = [] # List of all table rows
for row_tag in rows_tags:
### IMPORTANT: Mine All Columns In This Row || LOWEST LEVEL IN THE MINING OPERATION. ###
# SCRAPE 4.A
row = [col.text for col in row_tag.findAll('td')] # `td` represents a column in a row.
rows.append (row) # Add this row to all the other rows of this table
# RETURN: The column header and the rows of this table
return [columns, rows]
### Look Deeper ###
def extract_player (soup):
""" Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
player = [] # A list store data in
# player name is first in player list
player.append (extract_player_name (soup))
# Each table is a list entry
for season in range(1,4):
### IMPORTANT: No Table Related Data Has Been Mined Yet. START HERE ###
### - Line: 37
table = extract_table (season, soup) # `season` represents the table id
player.append(table) # Add this table(list to the player data list
# Return the player list
return player
##################################################
################## START HERE ####################
##################################################
###
### OBJECTIVE:
###
### - Follow the trail of important lines that extract the data
### - Important lines will be marked as the following `### ... ###`
###
### All this code really needs is a url and the `extract_table()` function.
###
### The `main()` function is where the journey starts
###
##################################################
##################################################
def main ():
""" The main function is the core program code. """
# Luckily the pages we will scrape all have the same layout making mining easier.
all_players = [] # A place to store all the data
# Values used to alter the url when making requests to access more player statistics
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
# Instead of looping through variables that dont tell a story,
# lets create a list of urls generated from those variables.
# This way the code is self-explanatory and is human-readable.
urls = generate_urls(idList2, splitList) # The creation of the url is not important right now
# Lets scrape each url
for url in urls:
print url
# First Step: get a web page via http request.
response = requests.get (url)
# Second step: use a parsing library to create a parsable object
soup = Soup(response.text, "html.parser") # Create a soup object (Once)
### IMPORTANT: Parsing Starts and Ends Here ###
### - Line: 75
# Final Step: Given a soup object, mine player data
player = extract_player (soup)
# Add the new entry to the list
all_players += [player]
return all_players
# If this script is being run, not imported, run the `main()` function.
if __name__ == '__main__':
all_players = main ()
print all_players[0][0] # Player List -> Name
print all_players[0][1] # Player List -> Table 1
print all_players[0][2] # Player List -> Table 2
print all_players[0][3] # Player List -> Table 3
print all_players[0][3][0] # Player List -> Table 1 -> Columns
print all_players[0][3][1] # Player List -> Table 1 -> All Rows
print all_players[0][3][1][0] # Player List -> Table 1 -> All Rows -> Row 1
print all_players[0][3][1][2] # Player List -> Table 1 -> All Rows -> Row 2
print all_players[0][3][1][2][0] # Player List -> Table 1 -> All Rows -> Row 2 -> Colum 1
答案 1 :(得分:0)
我已使用separated functionality
(根据要求)更新了代码lists instead of dictionaries
。 Lines 85+ is output testing (can ignore)
。
我现在看到你为同一个玩家提出多个请求(4)以收集更多数据。在我提供的last answer
中,代码only kept
由last request
制作。使用列表消除了这个问题。
您可能希望压缩列表,使其每个玩家只有一个条目。
core of the program
位于lines 65-77
all_player
以上的所有内容都是处理抓取的函数。from bs4 import BeautifulSoup as Soup
import requests
def http_get (id, split):
""" Make a get request, return the response. """
# Create url parameters dictinoary
payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
url = 'http://www.fangraphs.com/statsplits.aspx'
return requests.get(url, params=payload) # Pass payload through `requests.get()`
def extract_player_name (soup):
""" Extract the player name from the browser title """
# Browser title contains player name, strip all but name
player_name = repr(soup.title.text.strip('\r\n\t'))
player_name = player_name.split(' \\xbb')[0] # Split on ` »`
player_name = player_name[2:] # Erase a leading characters from using `repr`
return player_name
def extract_table (table_id, soup):
""" Extract data from a table, return the column headers and the table rows"""
# SCRAPE: Get a table
table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})
# SCRAPE: Extract table column headers
columns = [th.text for th in table_tag.findAll('th')]
rows = []
# SCRAPE: Extract Table Contents
for row in table_tag.tbody.findAll('tr'):
rows.append ([col.text for col in row.findAll('td')]) # Gather all columns in the row
# RETURN: [columns, rows]
return [columns, rows]
def extract_player (soup):
""" Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
player = []
# player name is first in player list
player.append (extract_player_name (soup))
# Each table is a list entry
for season in range(1,4):
player.append(extract_table (season, soup))
# Return the player list
return player
# A list of all players
all_players = [
#'playername',
#[table_columns, table_rows],
#[table_columns, table_rows],
#[['Season', 'vs R as R'], [['2015', 'yes'], ['2016', 'no'], ['2017', 'no'],]],
]
# I dont know what these values are. Sorry!
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
# Scrape data
for id in idList2:
for split in splitList:
response = http_get (id, split)
soup = Soup(response.text, "html.parser") # Create a soup object (Once)
all_players.append (extract_player (soup))
# or all_players += [scrape_player (soup)]
# Output data
def PrintPlayerAsTable (player, show_name=True):
if show_name: print player[0] # First entry is the player name
for table in player[1:]: # All other entries are tables
PrintTableAsTable(table)
def PrintTableAsTable (table, table_sep='\n'):
print table_sep
PrintRowAsTable(table[0]) # The first row in the table is the columns
for row in table[1]: # The second item in the table is a list of rows
PrintRowAsTable (row)
def PrintRowAsTable (row=[], prefix='\t'):
""" Print out the list in a table foramt. """
print prefix + ''.join([col.ljust(15) for col in row])
# There are 4 entries to every player, one for each request made
PrintPlayerAsTable (all_players[0])
PrintPlayerAsTable (all_players[1], False)
PrintPlayerAsTable (all_players[2], False)
PrintPlayerAsTable (all_players[3], False)
print '\n\nScraped %d player Statistics' % len(all_players)
for player in all_players:
print '\t- %s' % player[0]
# 4th player entry
print '\n\n'
print all_players[4][0] # Player name
print '\n'
#print all_players[4][1] # Table 1
print all_players[4][1][0] # Table 1 Column Headers
#print all_players[4][1][1] # Table 1 Rows
print all_players[4][1][1][1] # Table 1 Rows Row 1
print all_players[4][1][1][2] # Table 1 Rows Row 2
print all_players[4][1][1][-1] # Table 1 Rows Last Row
print '\n'
#print all_players[4][2] # Table 2
print all_players[4][2][0] # Table 2 Column Headers
#print all_players[4][2][1] # Table 2 Rows
print all_players[4][2][1][1] # Table 2 Rows Row 1
print all_players[4][2][1][2] # Table 2 Rows Row 2
print all_players[4][2][1][-1] # Table 2 Rows Last Row
print '\nTable 3'
PrintRowAsTable(all_players[4][2][0], '') # Table 3 Column Headers
PrintRowAsTable(all_players[4][2][1][1], '') # Table 3 Rows Row 1
PrintRowAsTable(all_players[4][2][1][2], '') # Table 3 Rows Row 2
PrintRowAsTable(all_players[4][2][1][-1], '') # Table 3 Rows Last Row
输出抓取的数据,以便了解all_players
的结构。
Aaron Judge
Season vs R as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as R 27 69 77 14 8 2 0 4 8 10 6 0 32 1 1 0 2 0 0 .203
2017 vs R as R 66 198 231 65 34 10 2 19 37 42 31 3 71 2 0 0 8 3 0 .328
Total vs R as R 93 267 308 79 42 12 2 23 45 52 37 3 103 3 1 0 10 3 0 .296
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as R 7.8 % 41.6 % 0.19 .203 .273 .406 .679 .203 .294 7 -1.7 .291 79
2017 vs R as R 13.4 % 30.7 % 0.44 .328 .424 .687 1.111 .359 .426 54 26.1 .454 189
Total vs R as R 12.0 % 33.4 % 0.36 .296 .386 .614 1.001 .318 .394 62 24.4 .413 162
Season vs R as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as R 0.74 13.2 % 36.8 % 50.0 % 0.0 % 21.1 % 7.1 % 0.0 % 50.0 % 29.0 % 21.1 % 7.9 % 42.1 % 50.0 % 327 117 210
2017 vs R as R 1.14 27.6 % 38.6 % 33.9 % 2.3 % 44.2 % 6.1 % 0.0 % 45.7 % 26.8 % 27.6 % 11.0 % 39.4 % 49.6 % 985 395 590
Total vs R as R 1.02 24.2 % 38.2 % 37.6 % 1.6 % 37.1 % 6.3 % 0.0 % 46.7 % 27.3 % 26.1 % 10.3 % 40.0 % 49.7 % 1312 512 800
Season vs R as L G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as L 3 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 .000
2017 vs R as L 20 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 3 1 .000
Total vs R as L 23 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 3 2 .000
Season vs R as L BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000
2017 vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000
Total vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000
Season vs R as L GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0 0 0
2017 vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0 0 0
Total vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0 0 0
Season vs L as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs L as R 11 15 18 1 1 0 0 0 0 0 3 0 10 0 0 0 0 0 0 .067
2017 vs L as R 26 47 61 16 9 1 1 5 9 12 13 0 16 1 0 0 2 0 0 .340
Total vs L as R 37 62 79 17 10 1 1 5 9 12 16 0 26 1 0 0 2 0 0 .274
Season vs L as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs L as R 16.7 % 55.6 % 0.30 .067 .222 .067 .289 .000 .200 0 -2.3 .164 -8
2017 vs L as R 21.3 % 26.2 % 0.81 .340 .492 .723 1.215 .383 .423 17 9.1 .496 218
Total vs L as R 20.3 % 32.9 % 0.62 .274 .430 .565 .995 .290 .387 16 6.8 .421 166
Season vs L as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs L as R 0.33 20.0 % 20.0 % 60.0 % 0.0 % 0.0 % 0.0 % 0.0 % 20.0 % 60.0 % 20.0 % 20.0 % 40.0 % 40.0 % 81 32 49
2017 vs L as R 0.73 16.1 % 35.5 % 48.4 % 0.0 % 33.3 % 0.0 % 0.0 % 29.0 % 48.4 % 22.6 % 16.1 % 35.5 % 48.4 % 295 135 160
Total vs L as R 0.67 16.7 % 33.3 % 50.0 % 0.0 % 27.8 % 0.0 % 0.0 % 27.8 % 50.0 % 22.2 % 16.7 % 36.1 % 47.2 % 376 167 209
Season vs R as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as R 27 69 77 14 8 2 0 4 8 10 6 0 32 1 1 0 2 0 0 .203
2017 vs R as R 66 198 231 65 34 10 2 19 37 42 31 3 71 2 0 0 8 3 0 .328
Total vs R as R 93 267 308 79 42 12 2 23 45 52 37 3 103 3 1 0 10 3 0 .296
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as R 7.8 % 41.6 % 0.19 .203 .273 .406 .679 .203 .294 7 -1.7 .291 79
2017 vs R as R 13.4 % 30.7 % 0.44 .328 .424 .687 1.111 .359 .426 54 26.1 .454 189
Total vs R as R 12.0 % 33.4 % 0.36 .296 .386 .614 1.001 .318 .394 62 24.4 .413 162
Season vs R as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as R 0.74 13.2 % 36.8 % 50.0 % 0.0 % 21.1 % 7.1 % 0.0 % 50.0 % 29.0 % 21.1 % 7.9 % 42.1 % 50.0 % 327 117 210
2017 vs R as R 1.14 27.6 % 38.6 % 33.9 % 2.3 % 44.2 % 6.1 % 0.0 % 45.7 % 26.8 % 27.6 % 11.0 % 39.4 % 49.6 % 985 395 590
Total vs R as R 1.02 24.2 % 38.2 % 37.6 % 1.6 % 37.1 % 6.3 % 0.0 % 46.7 % 27.3 % 26.1 % 10.3 % 40.0 % 49.7 % 1312 512 800
Scraped 8 player Statistics
- Aaron Judge
- Aaron Judge
- Aaron Judge
- Aaron Judge
- A.J. Pollock
- A.J. Pollock
- A.J. Pollock
- A.J. Pollock
A.J. Pollock
[u'Season', u'vs R as R', u'G', u'AB', u'PA', u'H', u'1B', u'2B', u'3B', u'HR', u'R', u'RBI', u'BB', u'IBB', u'SO', u'HBP', u'SF', u'SH', u'GDP', u'SB', u'CS', u'AVG']
[u'2013', u'vs R as R', u'115', u'270', u'295', u'70', u'52', u'12', u'2', u'4', u'25', u'21', u'21', u'1', u'54', u'1', u'0', u'3', u'4', u'3', u'1', u'.259']
[u'2014', u'vs R as R', u'71', u'215', u'232', u'66', u'42', u'17', u'3', u'4', u'21', u'14', u'15', u'0', u'41', u'2', u'0', u'0', u'3', u'7', u'1', u'.307']
[u'Total', u'vs R as R', u'395', u'1120', u'1230', u'330', u'225', u'67', u'13', u'25', u'122', u'102', u'93', u'1', u'199', u'5', u'9', u'3', u'23', u'41', u'6', u'.295']
[u'Season', u'vs R as R', u'BB%', u'K%', u'BB/K', u'AVG', u'OBP', u'SLG', u'OPS', u'ISO', u'BABIP', u'wRC', u'wRAA', u'wOBA', u'wRC+']
[u'2013', u'vs R as R', u'7.1 %', u'18.3 %', u'0.39', u'.259', u'.315', u'.363', u'.678', u'.104', u'.311', u'29', u'-3.0', u'.301', u'84']
[u'2014', u'vs R as R', u'6.5 %', u'17.7 %', u'0.37', u'.307', u'.358', u'.470', u'.828', u'.163', u'.365', u'35', u'9.6', u'.364', u'128']
[u'Total', u'vs R as R', u'7.6 %', u'16.2 %', u'0.47', u'.295', u'.349', u'.445', u'.793', u'.150', u'.337', u'168', u'30.7', u'.345', u'113']
Table 3
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2013 vs R as R 7.1 % 18.3 % 0.39 .259 .315 .363 .678 .104 .311 29 -3.0 .301 84
2014 vs R as R 6.5 % 17.7 % 0.37 .307 .358 .470 .828 .163 .365 35 9.6 .364 128
Total vs R as R 7.6 % 16.2 % 0.47 .295 .349 .445 .793 .150 .337 168 30.7 .345 113