我正在使用BeautifulSoup4来抓取页面,以下功能给了我2个问题:
def getTeamRoster(teamURL):
html = urllib.request.urlopen(teamURL).read()
soup = BeautifulSoup(html)
teamPlayers = []
#second table
corebody = soup.find(id = "corebody")
teamTable = corebody.table.next_sibling.next_sibling.next_sibling.next_sibling
print(teamTable)
tableBody = teamTable.find('tbody')
print(tableBody)
tableRows = tableBody.findAll('tr')
1)当我只调用“.next_sibling”4次(如上所述)时,我似乎来获取正确的表格。但是,我试图访问的表标记是#corebody ID中的第6个表。当我调用“.next_sibling”5次时,我从BeautifulSoup得到-1,表示我请求的表不存在?当发生这种情况时,我以为你通常会得到无。任何想法为什么5次调用“.next_sibling”都没有按预期工作?
网址为http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=11325
2)tableBody = teamTable.find('tbody') 给我一些麻烦。当我打印tableBody时,我得到None但我不确定为什么会这样(我正在访问的表中肯定有一个标签)。
想法?
感谢您的帮助, bclayman
答案 0 :(得分:2)
我可以使用pandas.read_html
获取玩家表:
import requests
import pandas as pd
url = 'http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=11325'
tables = pd.read_html(requests.get(url).content)
tables[4]
\n\t\t\t\tPlayers\n\t\t\t City Gender SinglesRating TeamPosition Expiration Win/Loss P Registered Code Ref. Exam
0 Browne,Noah Taunton M 5.56 1 02/29/2016 14 / 4 - 08/28/14 - NaN
1 Ellis,Thornton rye M 4.27 10 02/29/2016 0 / 9 - 08/28/14 - pass
2 Line,James Glastonbury M 4.25 10 02/29/2016 2 / 7 - 08/28/14 - NaN
3 Desantis,Scott J. Sudbury M 5.08 2 02/29/2016 9 / 10 - 08/28/14 - pass
4 Bahadori,Cameron Great Falls M 4.97 3 01/12/2016 3 / 10 - 11/05/14 - pass
5 Groot,Michael Victoria M 4.76 4 02/29/2016 5 / 11 - 08/28/14 - NaN
6 Ehsani,Darian Greenwich M 4.76 5 02/29/2016 6 / 13 - 08/28/14 - pass
7 Kardon,Max Weston M 4.83 6 02/29/2016 5 / 14 - 08/28/14 - pass
8 Van,Jeremy NaN M 4.66 7 02/29/2016 5 / 13 - 08/28/14 - NaN
9 Southmayd,Alexander T. Boston M 4.91 8 02/29/2016 13 / 6 - 08/28/14 - pass
10 Cacouris,Stephen A Alpine M 4.68 9 02/29/2016 9 / 10 - 08/28/14 - pass
11 Groot,Christopher Edmonton M 4.62 - 02/29/2016 0 / 2 - 08/28/14 - NaN
12 Mack,Peter D. (sub) N. Eastham M 3.94 - 02/29/2016 0 / 1 - 11/23/14 - NaN
13 Shrager,Nathaniel O. Stanford M 0.00 - 02/29/2016 0 / 0 - 08/28/14 - NaN
14 Woolverton,Peter C. Chestnut Hill M 4.06 - 02/29/2016 1 / 0 - 08/28/14 - NaN
15 Total Players: 15 Average singles rating: 4.36... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
答案 1 :(得分:1)
使用soup.select
一个班轮:
[i.get_text() for i in soup.select('#corebody table tr td') if 'Won' in i.get_text() or 'Lost' in i.get_text()]`
长版:
for i in soup.select('#corebody table tr td'):
if 'Won' in i.get_text() or 'Lost' in i.get_text():
print i.get_text()`
[u'Won 7-2',
u'Won 5-4',
u'Lost 1-8',
u'Lost 1-8',
u'Won 8-1',
u'Lost 3-6',
u'Won 7-2',
u'Lost 0-9',
u'Lost 1-8',
u'Won 5-4',
u'Lost 1-8',
u'Lost 2-7',
u'Won 8-1',
u'Lost 3-6',
u'Lost 4-5',
u'Lost 4-5',
u'Lost 1-8',
u'Lost 4-5',
u'Won 6-3']