使用BeautifulSoup循环通过表行

时间:2017-06-21 21:55:53

标签: python loops for-loop beautifulsoup tablerow

我需要帮助循环遍历表行并将它们放入列表中。在这个网站上,有三个表,每个表都有不同的统计数据 - http://www.fangraphs.com/statsplits.aspx?playerid=15640&position=OF&season=0&split=0.4

例如,这三个表包含2016年,2017年的行和总行数。我想要以下内容:

以下列表 - >表1 - 第1行,第2行 - 第1行,第3行 - 第1行 以下的第二个列表 - >表1 - 第2行,第2行 - 第2行,第3行 - 第2行 第三个清单: - >表1 - 第3行,表2 - 第3行,第3行 - 第3行

我知道我显然需要创建列表,并且需要使用append函数;但是,我不知道如何让它循环遍历每个表的第一行,然后是每个表的第二行,等等到表的每一行(每个实例中的行数会有所不同 - 这一行恰巧有3)。

非常感谢任何帮助。代码如下:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
for id in idList2:
    pos = 'OF'
    for split in splitList:
        url = 'http://www.fangraphs.com/statsplits.aspx?playerid=' + 
            str(id) + '&position=' + str(pos) + '&season=0&split=' + 
            str(split) + ''
        r = requests.get(url)

        for season in range(1,4):
            print(season)
            soup = BeautifulSoup(r.text, "html.parser")
            tableStats = soup.find("table", {"id" :  "SeasonSplits1_dgSeason" + str(season) + "_ctl00"})
            soup.findAll('th')
            column_headers = [th.getText() for th in soup.findAll('th')]                      
            statistics = soup.find("table", {"id" :                     
'"SeasonSplits1_dgSeason" + str(season) + "_ctl00"})'
            tabledata = [td.getText() for td in statistics('td')]                         
            print(tabledata)

2 个答案:

答案 0 :(得分:1)

这将是我的最后一次尝试。它有你应该需要的一切。我创建了一个回溯到表,行和列被刮的位置。这一切都发生在函数extract_table()中。按照追溯标记,不要担心任何其他代码。不要让大文件大小让你担心它主要是文档和间距。

跟踪标记:### ... ###

从第95行开始,带有追溯标记### START HERE ###

from bs4 import BeautifulSoup as Soup
import requests
import urllib


###### GO TO LINE 95 ######


### IGNORE ###
def generate_urls (idList, splitList):
    """ Using and id list and a split list generate a list urls"""
    urls = []
    url = 'http://www.fangraphs.com/statsplits.aspx'

    for id in idList:
        for split in splitList:
            # The parameters used in creating the url
            url_payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
            # Create the url and store add to the collection of urls
            urls += ['?'.join([url, urllib.urlencode(url_payload)])]
    return urls # Return the list of urls




### IGNORE ###
def extract_player_name (soup):
    """ Extract the player name from the browser title """
    # Browser title contains player name, strip all but name
    player_name = repr(soup.title.text.strip('\r\n\t')) 
    player_name = player_name.split(' \\xbb')[0] # Split on ` »`
    player_name = player_name[2:] # Erase a leading characters from using `repr`
    return player_name



########## FINISH HERE ##########
def extract_table (table_id, soup):
    """ Extract data from a table, return the column headers and the table rows"""

    ### IMPORTANT: THIS CODE IS WHERE ALL THE MAGIC HAPPENS ### 
    # - First: Find lowest level tag of all the data we want (container).
    #
    # - Second: Extract the table column headers, requires minimal mining
    #
    # - Third: Gather a list of tags that represent the tables rows
    #
    # - Fourth: Loop through the list of rows 
    #      A): Mine all columns in the row

    ### IMPORTANT: Get A Reference To The Table ###
    # SCRAPE 1:
    table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})            

    # SCRAPE 2: 
    columns = [th.text for th in table_tag.findAll('th')]

    # SCRAPE 3: 
    rows_tags = table_tag.tbody.findAll('tr'); # All 'tr' tags in the table `tbody` tag are row tags

    ### IMPORTANT: Cycle Through Rows And Collect Column Data ###
    # SCRAPE 4:
    rows = [] # List of all table rows
    for row_tag in rows_tags:

        ### IMPORTANT: Mine All Columns In This Row || LOWEST LEVEL IN THE MINING OPERATION. ###
        # SCRAPE 4.A
        row = [col.text for col in row_tag.findAll('td')] # `td` represents a column in a row.

        rows.append (row) # Add this row to all the other rows of this table  

    # RETURN: The column header and the rows of this table
    return [columns, rows]



### Look Deeper ###
def extract_player (soup):
    """ Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
    player = [] # A list store data in

    # player name is first in player list
    player.append (extract_player_name (soup))

    # Each table is a list entry
    for season in range(1,4): 
        ### IMPORTANT: No Table Related Data Has Been Mined Yet. START HERE ###
        ###     - Line: 37
        table = extract_table (season, soup) # `season` represents the table id 
        player.append(table) # Add this table(list to the player data list

    # Return the player list    
    return player


##################################################
################## START HERE ####################
##################################################
###
### OBJECTIVE: 
###
### - Follow the trail of important lines that extract the data
###     - Important lines will be marked as the following `### ... ###`
### 
### All this code really needs is a url and the `extract_table()` function.
###
### The `main()` function is where the journey starts
###
##################################################
##################################################



def main ():
    """ The main function is the core program code. """

    # Luckily the pages we will scrape all have the same layout making mining easier.    

    all_players = [] # A place to store all the data

    # Values used to alter the url when making requests to access more player statistics
    idList2 = ['15640', '9256']
    splitList=[0.4,0.2,0.3,0.4]

    # Instead of looping through variables that dont tell a story,
    # lets create a list of urls generated from those variables.
    # This way the code is self-explanatory and is human-readable.
    urls = generate_urls(idList2, splitList) # The creation of the url is not important right now

    # Lets scrape each url
    for url in urls:
        print url

        # First Step: get a web page via http request.
        response = requests.get (url)

        # Second step: use a parsing library to create a parsable object 
        soup = Soup(response.text, "html.parser") # Create a soup object (Once)

        ### IMPORTANT: Parsing Starts and Ends Here ###
        ###     - Line: 75
        # Final Step: Given a soup object, mine player data
        player = extract_player (soup)

        # Add the new entry to the list
        all_players += [player]

    return all_players





# If this script is being run, not imported, run the `main()` function.
if __name__ == '__main__':
    all_players = main ()

    print all_players[0][0] # Player List -> Name
    print all_players[0][1] # Player List -> Table 1
    print all_players[0][2] # Player List -> Table 2
    print all_players[0][3] # Player List -> Table 3

    print all_players[0][3][0]       # Player List -> Table 1 -> Columns
    print all_players[0][3][1]       # Player List -> Table 1 -> All Rows
    print all_players[0][3][1][0]    # Player List -> Table 1 -> All Rows -> Row 1
    print all_players[0][3][1][2]    # Player List -> Table 1 -> All Rows -> Row 2
    print all_players[0][3][1][2][0] # Player List -> Table 1 -> All Rows -> Row 2 -> Colum 1

答案 1 :(得分:0)

我已使用separated functionality(根据要求)更新了代码lists instead of dictionariesLines 85+ is output testing (can ignore)

我现在看到你为同一个玩家提出多个请求(4)以收集更多数据。在我提供的last answer中,代码only keptlast request制作。使用列表消除了这个问题。

您可能希望压缩列表,使其每个玩家只有一个条目。

  • core of the program位于lines 65-77
  • 第57行上all_player以上的所有内容都是处理抓取的函数。

更新:scrape_players.py

from bs4 import BeautifulSoup as Soup
import requests


def http_get (id, split):
    """ Make a get request, return the response. """
    # Create url parameters dictinoary
    payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
    url = 'http://www.fangraphs.com/statsplits.aspx'
    return requests.get(url, params=payload) # Pass payload through `requests.get()`


def extract_player_name (soup):
    """ Extract the player name from the browser title """
    # Browser title contains player name, strip all but name
    player_name = repr(soup.title.text.strip('\r\n\t')) 
    player_name = player_name.split(' \\xbb')[0] # Split on ` »`
    player_name = player_name[2:] # Erase a leading characters from using `repr`
    return player_name


def extract_table (table_id, soup):
    """ Extract data from a table, return the column headers and the table rows"""
    # SCRAPE: Get a table
    table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})            

    # SCRAPE: Extract table column headers
    columns = [th.text for th in table_tag.findAll('th')]

    rows = [] 
    # SCRAPE: Extract Table Contents
    for row in table_tag.tbody.findAll('tr'):
        rows.append ([col.text for col in row.findAll('td')])  # Gather all columns in the row

    # RETURN: [columns, rows]
    return [columns, rows]


def extract_player (soup):
    """ Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
    player = []

    # player name is first in player list
    player.append (extract_player_name (soup))

    # Each table is a list entry
    for season in range(1,4): 
        player.append(extract_table (season, soup))
    # Return the player list    
    return player





# A list of all players
all_players = [
    #'playername', 
    #[table_columns, table_rows],
    #[table_columns, table_rows],
    #[['Season', 'vs R as R'], [['2015', 'yes'], ['2016', 'no'], ['2017', 'no'],]],
]

# I dont know what these values are. Sorry!
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]


# Scrape data
for id in idList2:    
    for split in splitList:
        response = http_get (id, split)

        soup = Soup(response.text, "html.parser") # Create a soup object (Once)

        all_players.append (extract_player (soup))
        # or all_players += [scrape_player (soup)]






# Output data
def PrintPlayerAsTable (player, show_name=True):
    if show_name: print player[0] # First entry is the player name
    for table in player[1:]: # All other entries are tables
        PrintTableAsTable(table)

def PrintTableAsTable (table, table_sep='\n'):
    print table_sep
    PrintRowAsTable(table[0]) # The first row in the table is the columns
    for row in table[1]: # The second item in the table is a list of rows
        PrintRowAsTable (row)

def PrintRowAsTable (row=[], prefix='\t'):
    """ Print out the list in a table foramt. """
    print prefix + ''.join([col.ljust(15) for col in row])



# There are 4 entries to every player, one for each request made
PrintPlayerAsTable (all_players[0])
PrintPlayerAsTable (all_players[1], False)
PrintPlayerAsTable (all_players[2], False)
PrintPlayerAsTable (all_players[3], False)


print '\n\nScraped %d player Statistics' % len(all_players) 
for player in all_players:
    print '\t- %s' % player[0]


# 4th player entry
print '\n\n'
print all_players[4][0] # Player name

print '\n'
#print all_players[4][1]        # Table 1
print all_players[4][1][0]     # Table 1 Column Headers
#print all_players[4][1][1]     # Table 1 Rows
print all_players[4][1][1][1]  # Table 1 Rows Row 1
print all_players[4][1][1][2]  # Table 1 Rows Row 2
print all_players[4][1][1][-1] # Table 1 Rows Last Row 

print '\n'
#print all_players[4][2]        # Table 2
print all_players[4][2][0]     # Table 2 Column Headers
#print all_players[4][2][1]     # Table 2 Rows
print all_players[4][2][1][1]  # Table 2 Rows Row 1
print all_players[4][2][1][2]  # Table 2 Rows Row 2
print all_players[4][2][1][-1] # Table 2 Rows Last Row 

print '\nTable 3'
PrintRowAsTable(all_players[4][2][0], '')     # Table 3 Column Headers
PrintRowAsTable(all_players[4][2][1][1], '')  # Table 3 Rows Row 1
PrintRowAsTable(all_players[4][2][1][2], '')  # Table 3 Rows Row 2
PrintRowAsTable(all_players[4][2][1][-1], '') # Table 3 Rows Last Row   

输出:

输出抓取的数据,以便了解all_players的结构。

Aaron Judge


    Season         vs R as R      G              AB             PA             H              1B             2B             3B             HR             R              RBI            BB             IBB            SO             HBP            SF             SH             GDP            SB             CS             AVG            
    2016           vs R as R      27             69             77             14             8              2              0              4              8              10             6              0              32             1              1              0              2              0              0              .203           
    2017           vs R as R      66             198            231            65             34             10             2              19             37             42             31             3              71             2              0              0              8              3              0              .328           
    Total          vs R as R      93             267            308            79             42             12             2              23             45             52             37             3              103            3              1              0              10             3              0              .296           


    Season         vs R as R      BB%            K%             BB/K           AVG            OBP            SLG            OPS            ISO            BABIP          wRC            wRAA           wOBA           wRC+           
    2016           vs R as R      7.8 %          41.6 %         0.19           .203           .273           .406           .679           .203           .294           7              -1.7           .291           79             
    2017           vs R as R      13.4 %         30.7 %         0.44           .328           .424           .687           1.111          .359           .426           54             26.1           .454           189            
    Total          vs R as R      12.0 %         33.4 %         0.36           .296           .386           .614           1.001          .318           .394           62             24.4           .413           162            


    Season         vs R as R      GB/FB          LD%            GB%            FB%            IFFB%          HR/FB          IFH%           BUH%           Pull%          Cent%          Oppo%          Soft%          Med%           Hard%          Pitches        Balls          Strikes        
    2016           vs R as R      0.74           13.2 %         36.8 %         50.0 %         0.0 %          21.1 %         7.1 %          0.0 %          50.0 %         29.0 %         21.1 %         7.9 %          42.1 %         50.0 %         327            117            210            
    2017           vs R as R      1.14           27.6 %         38.6 %         33.9 %         2.3 %          44.2 %         6.1 %          0.0 %          45.7 %         26.8 %         27.6 %         11.0 %         39.4 %         49.6 %         985            395            590            
    Total          vs R as R      1.02           24.2 %         38.2 %         37.6 %         1.6 %          37.1 %         6.3 %          0.0 %          46.7 %         27.3 %         26.1 %         10.3 %         40.0 %         49.7 %         1312           512            800            


    Season         vs R as L      G              AB             PA             H              1B             2B             3B             HR             R              RBI            BB             IBB            SO             HBP            SF             SH             GDP            SB             CS             AVG            
    2016           vs R as L      3              0              0              0              0              0              0              0              2              0              0              0              0              0              0              0              0              0              1              .000           
    2017           vs R as L      20             0              0              0              0              0              0              0              13             0              0              0              0              0              0              0              0              3              1              .000           
    Total          vs R as L      23             0              0              0              0              0              0              0              15             0              0              0              0              0              0              0              0              3              2              .000           


    Season         vs R as L      BB%            K%             BB/K           AVG            OBP            SLG            OPS            ISO            BABIP          wRC            wRAA           wOBA           wRC+           
    2016           vs R as L      0.0 %          0.0 %          0.00           .000           .000           .000           .000           .000           .000           0              0.0            .000                          
    2017           vs R as L      0.0 %          0.0 %          0.00           .000           .000           .000           .000           .000           .000           0              0.0            .000                          
    Total          vs R as L      0.0 %          0.0 %          0.00           .000           .000           .000           .000           .000           .000           0              0.0            .000                          


    Season         vs R as L      GB/FB          LD%            GB%            FB%            IFFB%          HR/FB          IFH%           BUH%           Pull%          Cent%          Oppo%          Soft%          Med%           Hard%          Pitches        Balls          Strikes        
    2016           vs R as L      0.00           0.0 %          0.0 %          0.0 %          0.0 %          0.0 %          0.0 %          0.0 %                                                                                                    0              0              0              
    2017           vs R as L      0.00           0.0 %          0.0 %          0.0 %          0.0 %          0.0 %          0.0 %          0.0 %                                                                                                    0              0              0              
    Total          vs R as L      0.00           0.0 %          0.0 %          0.0 %          0.0 %          0.0 %          0.0 %          0.0 %                                                                                                    0              0              0              


    Season         vs L as R      G              AB             PA             H              1B             2B             3B             HR             R              RBI            BB             IBB            SO             HBP            SF             SH             GDP            SB             CS             AVG            
    2016           vs L as R      11             15             18             1              1              0              0              0              0              0              3              0              10             0              0              0              0              0              0              .067           
    2017           vs L as R      26             47             61             16             9              1              1              5              9              12             13             0              16             1              0              0              2              0              0              .340           
    Total          vs L as R      37             62             79             17             10             1              1              5              9              12             16             0              26             1              0              0              2              0              0              .274           


    Season         vs L as R      BB%            K%             BB/K           AVG            OBP            SLG            OPS            ISO            BABIP          wRC            wRAA           wOBA           wRC+           
    2016           vs L as R      16.7 %         55.6 %         0.30           .067           .222           .067           .289           .000           .200           0              -2.3           .164           -8             
    2017           vs L as R      21.3 %         26.2 %         0.81           .340           .492           .723           1.215          .383           .423           17             9.1            .496           218            
    Total          vs L as R      20.3 %         32.9 %         0.62           .274           .430           .565           .995           .290           .387           16             6.8            .421           166            


    Season         vs L as R      GB/FB          LD%            GB%            FB%            IFFB%          HR/FB          IFH%           BUH%           Pull%          Cent%          Oppo%          Soft%          Med%           Hard%          Pitches        Balls          Strikes        
    2016           vs L as R      0.33           20.0 %         20.0 %         60.0 %         0.0 %          0.0 %          0.0 %          0.0 %          20.0 %         60.0 %         20.0 %         20.0 %         40.0 %         40.0 %         81             32             49             
    2017           vs L as R      0.73           16.1 %         35.5 %         48.4 %         0.0 %          33.3 %         0.0 %          0.0 %          29.0 %         48.4 %         22.6 %         16.1 %         35.5 %         48.4 %         295            135            160            
    Total          vs L as R      0.67           16.7 %         33.3 %         50.0 %         0.0 %          27.8 %         0.0 %          0.0 %          27.8 %         50.0 %         22.2 %         16.7 %         36.1 %         47.2 %         376            167            209            


    Season         vs R as R      G              AB             PA             H              1B             2B             3B             HR             R              RBI            BB             IBB            SO             HBP            SF             SH             GDP            SB             CS             AVG            
    2016           vs R as R      27             69             77             14             8              2              0              4              8              10             6              0              32             1              1              0              2              0              0              .203           
    2017           vs R as R      66             198            231            65             34             10             2              19             37             42             31             3              71             2              0              0              8              3              0              .328           
    Total          vs R as R      93             267            308            79             42             12             2              23             45             52             37             3              103            3              1              0              10             3              0              .296           


    Season         vs R as R      BB%            K%             BB/K           AVG            OBP            SLG            OPS            ISO            BABIP          wRC            wRAA           wOBA           wRC+           
    2016           vs R as R      7.8 %          41.6 %         0.19           .203           .273           .406           .679           .203           .294           7              -1.7           .291           79             
    2017           vs R as R      13.4 %         30.7 %         0.44           .328           .424           .687           1.111          .359           .426           54             26.1           .454           189            
    Total          vs R as R      12.0 %         33.4 %         0.36           .296           .386           .614           1.001          .318           .394           62             24.4           .413           162            


    Season         vs R as R      GB/FB          LD%            GB%            FB%            IFFB%          HR/FB          IFH%           BUH%           Pull%          Cent%          Oppo%          Soft%          Med%           Hard%          Pitches        Balls          Strikes        
    2016           vs R as R      0.74           13.2 %         36.8 %         50.0 %         0.0 %          21.1 %         7.1 %          0.0 %          50.0 %         29.0 %         21.1 %         7.9 %          42.1 %         50.0 %         327            117            210            
    2017           vs R as R      1.14           27.6 %         38.6 %         33.9 %         2.3 %          44.2 %         6.1 %          0.0 %          45.7 %         26.8 %         27.6 %         11.0 %         39.4 %         49.6 %         985            395            590            
    Total          vs R as R      1.02           24.2 %         38.2 %         37.6 %         1.6 %          37.1 %         6.3 %          0.0 %          46.7 %         27.3 %         26.1 %         10.3 %         40.0 %         49.7 %         1312           512            800            


Scraped 8 player Statistics
    - Aaron Judge
    - Aaron Judge
    - Aaron Judge
    - Aaron Judge
    - A.J. Pollock
    - A.J. Pollock
    - A.J. Pollock
    - A.J. Pollock



A.J. Pollock


[u'Season', u'vs R as R', u'G', u'AB', u'PA', u'H', u'1B', u'2B', u'3B', u'HR', u'R', u'RBI', u'BB', u'IBB', u'SO', u'HBP', u'SF', u'SH', u'GDP', u'SB', u'CS', u'AVG']
[u'2013', u'vs R as R', u'115', u'270', u'295', u'70', u'52', u'12', u'2', u'4', u'25', u'21', u'21', u'1', u'54', u'1', u'0', u'3', u'4', u'3', u'1', u'.259']
[u'2014', u'vs R as R', u'71', u'215', u'232', u'66', u'42', u'17', u'3', u'4', u'21', u'14', u'15', u'0', u'41', u'2', u'0', u'0', u'3', u'7', u'1', u'.307']
[u'Total', u'vs R as R', u'395', u'1120', u'1230', u'330', u'225', u'67', u'13', u'25', u'122', u'102', u'93', u'1', u'199', u'5', u'9', u'3', u'23', u'41', u'6', u'.295']


[u'Season', u'vs R as R', u'BB%', u'K%', u'BB/K', u'AVG', u'OBP', u'SLG', u'OPS', u'ISO', u'BABIP', u'wRC', u'wRAA', u'wOBA', u'wRC+']
[u'2013', u'vs R as R', u'7.1 %', u'18.3 %', u'0.39', u'.259', u'.315', u'.363', u'.678', u'.104', u'.311', u'29', u'-3.0', u'.301', u'84']
[u'2014', u'vs R as R', u'6.5 %', u'17.7 %', u'0.37', u'.307', u'.358', u'.470', u'.828', u'.163', u'.365', u'35', u'9.6', u'.364', u'128']
[u'Total', u'vs R as R', u'7.6 %', u'16.2 %', u'0.47', u'.295', u'.349', u'.445', u'.793', u'.150', u'.337', u'168', u'30.7', u'.345', u'113']

Table 3
Season         vs R as R      BB%            K%             BB/K           AVG            OBP            SLG            OPS            ISO            BABIP          wRC            wRAA           wOBA           wRC+           
2013           vs R as R      7.1 %          18.3 %         0.39           .259           .315           .363           .678           .104           .311           29             -3.0           .301           84             
2014           vs R as R      6.5 %          17.7 %         0.37           .307           .358           .470           .828           .163           .365           35             9.6            .364           128            
Total          vs R as R      7.6 %          16.2 %         0.47           .295           .349           .445           .793           .150           .337           168            30.7           .345           113