BeautifulSoup:难以访问正确的表格

时间:2015-06-02 01:35:48

标签: python beautifulsoup

我正在使用BeautifulSoup4来抓取页面,以下功能给了我2个问题:

def getTeamRoster(teamURL):
    html = urllib.request.urlopen(teamURL).read()
    soup = BeautifulSoup(html)
    teamPlayers = []
    #second table
    corebody = soup.find(id = "corebody")
    teamTable = corebody.table.next_sibling.next_sibling.next_sibling.next_sibling
    print(teamTable)
    tableBody = teamTable.find('tbody')
    print(tableBody)
    tableRows = tableBody.findAll('tr')

1)当我只调用“.next_sibling”4次(如上所述)时,我似乎来获取正确的表格。但是,我试图访问的表标记是#corebody ID中的第6个表。当我调用“.next_sibling”5次时,我从BeautifulSoup得到-1,表示我请求的表不存在?当发生这种情况时,我以为你通常会得到无。任何想法为什么5次调用“.next_sibling”都没有按预期工作?

网址为http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=11325

2)tableBody = teamTable.find('tbody') 给我一些麻烦。当我打印tableBody时,我得到None但我不确定为什么会这样(我正在访问的表中肯定有一个标签)。

想法?

感谢您的帮助, bclayman

2 个答案:

答案 0 :(得分:2)

我可以使用pandas.read_html获取玩家表:

import requests
import pandas as pd

url = 'http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=11325'
tables = pd.read_html(requests.get(url).content)
tables[4]
                            \n\t\t\t\tPlayers\n\t\t\t           City Gender  SinglesRating TeamPosition  Expiration Win/Loss    P Registered Code Ref. Exam
0                                         Browne,Noah        Taunton      M           5.56            1  02/29/2016   14 / 4    -   08/28/14    -       NaN
1                                      Ellis,Thornton            rye      M           4.27           10  02/29/2016    0 / 9    -   08/28/14    -      pass
2                                          Line,James    Glastonbury      M           4.25           10  02/29/2016    2 / 7    -   08/28/14    -       NaN
3                                   Desantis,Scott J.        Sudbury      M           5.08            2  02/29/2016   9 / 10    -   08/28/14    -      pass
4                                    Bahadori,Cameron    Great Falls      M           4.97            3  01/12/2016   3 / 10    -   11/05/14    -      pass
5                                       Groot,Michael       Victoria      M           4.76            4  02/29/2016   5 / 11    -   08/28/14    -       NaN
6                                       Ehsani,Darian      Greenwich      M           4.76            5  02/29/2016   6 / 13    -   08/28/14    -      pass
7                                          Kardon,Max         Weston      M           4.83            6  02/29/2016   5 / 14    -   08/28/14    -      pass
8                                          Van,Jeremy            NaN      M           4.66            7  02/29/2016   5 / 13    -   08/28/14    -       NaN
9                              Southmayd,Alexander T.         Boston      M           4.91            8  02/29/2016   13 / 6    -   08/28/14    -      pass
10                                 Cacouris,Stephen A         Alpine      M           4.68            9  02/29/2016   9 / 10    -   08/28/14    -      pass
11                                  Groot,Christopher       Edmonton      M           4.62            -  02/29/2016    0 / 2    -   08/28/14    -       NaN
12                                Mack,Peter D. (sub)     N. Eastham      M           3.94            -  02/29/2016    0 / 1    -   11/23/14    -       NaN
13                               Shrager,Nathaniel O.       Stanford      M           0.00            -  02/29/2016    0 / 0    -   08/28/14    -       NaN
14                                Woolverton,Peter C.  Chestnut Hill      M           4.06            -  02/29/2016    1 / 0    -   08/28/14    -       NaN
15  Total Players: 15 Average singles rating: 4.36...            NaN    NaN            NaN          NaN         NaN      NaN  NaN        NaN  NaN       NaN

答案 1 :(得分:1)

使用soup.select

一个班轮:

[i.get_text() for i in soup.select('#corebody table tr td') if 'Won' in i.get_text() or 'Lost' in i.get_text()]`

长版:

for i in soup.select('#corebody table tr td'):
    if 'Won' in i.get_text() or 'Lost' in i.get_text():
        print i.get_text()`

[u'Won 7-2', u'Won 5-4', u'Lost 1-8', u'Lost 1-8', u'Won 8-1', u'Lost 3-6', u'Won 7-2', u'Lost 0-9', u'Lost 1-8', u'Won 5-4', u'Lost 1-8', u'Lost 2-7', u'Won 8-1', u'Lost 3-6', u'Lost 4-5', u'Lost 4-5', u'Lost 1-8', u'Lost 4-5', u'Won 6-3']