您好我是使用python和beautifulsoup4解析html表的新手。一直都很顺利,直到我遇到这个奇怪的表,在表的中途使用'th'标签,导致我的解析退出并抛出'索引超出范围'错误。我试过搜索SO和谷歌无济于事。问题是在解析表时我将如何忽略或删除这个流氓'th'标签?
这是我到目前为止的代码:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = 'https://www.moscone.com/site/do/event/list'
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find('table', { 'id' : 'list' })
for row in table.findAll('tr')[3:]:
col = row.findAll('td')
date = col[0].string
name = col[1].string
location = col[2].string
record = (name, date, location)
final = ','.join(record)
print(final)
这是html的一小部分导致我的错误
<td>
Convention
</td>
</tr>
<tr>
<th class="title" colspan="4">
Mon Dec 01 00:00:00 PST 2014
</th>
</tr>
<tr>
<td>
12/06/14 - 12/09/14
</td>
我确实希望这个流氓'th'上方和下方的数据表明桌子上新月的开始
答案 0 :(得分:2)
您可以检查th
中是否有row
并解析内容,如果没有,请执行以下操作:
for row in table.findAll('tr')[3:]:
# so make sure th is not in row
if not row.find_all('th'):
col = row.findAll('td')
date = col[0].string
name = col[1].string
location = col[2].string
record = (name, date, location)
final = ','.join(record)
print(final)
这是我在没有 IndexError 的情况下从您提供的网址获得的结果:
Out & Equal Workplace,11/03/14 - 11/06/14,Moscone West
Samsung Developer Conference,11/11/14 - 11/13/14,Moscone West
North American Spine Society (NASS) Annual Meeting,11/12/14 - 11/15/14,Moscone South and Esplanade Ballroom
San Francisco International Auto Show,11/22/14 - 11/29/14,Moscone North & South
67th Annual Meeting of the APS Division of Fluid Dynamics,11/23/14 - 11/25/14,Moscone North, South and West
American Society of Hematology,12/06/14 - 12/09/14,Moscone North, South and West
California School Boards Association,12/12/14 - 12/16/14,Moscone North & Esplanade Ballroom
American Geophysical Union,12/15/14 - 12/19/14,Moscone North & South