在Python中,我有一个包含html
表元素的变量,如下所示:
page = requests.get('http://www.myPage.com')
tree = html.fromstring(page.content)
table = tree.xpath('//table[@class="list"]')
table
变量具有以下内容:
<table class="list">
<tr>
<th>Date(s)</th>
<th>Sport</th>
<th>Event</th>
<th>Location</th>
</tr>
<tr>
<td>Jan 18-31</td>
<td>Tennis</td>
<td><a href="tennis-grand-slam/australian-open/index.htm">Australia Open</a></td>
<td>Melbourne, Australia</td>
</tr>
</table>
我试图像这样提取标题:
rows = iter(table)
headers = [col.text for col in next(rows)]
print "headers are: ", headers
但是,当我打印headers
变量时,我得到了这个:
headers are: ['\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n
', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n
', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n
', '\n ', '\n ']
如何正确提取标题?
答案 0 :(得分:0)
试试这个:
from lxml import html
HTML_CODE = """<table class="list">
<tr>
<th>Date(s)</th>
<th>Sport</th>
<th>Event</th>
<th>Location</th>
</tr>
<tr>
<td>Jan 18-31</td>
<td>Tennis</td>
<td><a href="tennis-grand-slam/australian-open/index.htm">Australia Open</a></td>
<td>Melbourne, Australia</td>
</tr>
</table>"""
tree = html.fromstring(HTML_CODE)
headers = tree.xpath('//table[@class="list"]/tr/th/text()')
print "Headers are: {}".format(', '.join(headers))
<强>输出:强>
Headers are: Date(s), Sport, Event, Location
答案 1 :(得分:0)
使用该表并假设只有一个:
table[0].xpath("//th/text()")
或者,如果您只想要表格中的标题,并且不打算将其用于其他任何您需要的内容:
headers = tree.xpath('//table[@class="list"]//th/text()')
两者都会给你:
['Date(s)', 'Sport', 'Event', 'Location']