在python中使用lxml库我如何读取html表的td值?我试过读取xpath表,但我找不到正确的参数来返回td值。谢谢大家,我很感激。
import sys
from glob import *
from lxml import etree, html
import requests
#Scan directory (current) and scrape the html files
dirScan = glob('html/*.*')
fileCount = 0
while(fileCount < len(dirScan)):
fileName = dirScan[fileCount]
page = open(fileName)
tree = html.fromstring(page.read())
tables = tree.xpath('//table')
print("Tables:",tables)
page.html中
<table style="width:100%">
<tr align="right"><td>1</td><td>John</td><td>Smith</td>
<tr align="right"><td>2</td><td>Tody</td><td>Miller</td>
</table>
答案 0 :(得分:1)
如果你想在 tr 中找到 td的并且右对齐,你需要使用align属性来过滤:
tds = tree.xpath("//table/tr[@align='right']/td")
如果你只想要每个td的文字:
.xpath("//table/tr[@align='right']/td/text()")
但实际上你可能想保留关联,所以你应该找到trs,然后将td文本分组:
x = """<table style="width:100%">
<tr align="right"><td>1</td><td>John</td><td>Smith</td>
<tr align="right"><td>2</td><td>Tody</td><td>Miller</td>
</table> """
from lxml import html
tree = html.fromstring(x)
# first get the trs, filtering by attribute
trs = tree.xpath("//table/tr[@align='right']")
# then extract the tds from each tr
data = [row.xpath("td/text()") for row in trs]
哪会给你:
[['1', 'John', 'Smith'], ['2', 'Tody', 'Miller']]
如果你只想要每个名字,你可以跳过第一个td:
trs = tree.xpath("//table/tr[@align='right']")
# position() > 1, all but the first td, xpath has one based indexing.
names = [row.xpath("td[position()> 1]/text()") for row in trs])
或加入一个字符串:
full_names [" ".join(row.xpath("td[position()> 1]/text()")) for row in trs]
答案 1 :(得分:0)
代码
>>> page="""<table style="width:100%">
<tr>
<th>Id</th>
<th>Name</th>
<th>Age</th>
</tr>
<tr>
<td>1</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>2</td>
<td>Jackson</td>
<td>94</td>
</tr>
<tr>
<td>3</td>
<td>Miller</td>
<td>43</td>
</tr>
</table> """
>>> tree=html.fromstring(s)
>>> tree.xpath('//tr/td//text()')
输出:
['1', 'Smith', '50', '2', 'Jackson', '94', '3', 'Miller', '43']