Python - 阅读表

时间:2016-10-20 23:27:43

标签: python html xpath web-scraping lxml

在python中使用lxml库我如何读取html表的td值?我试过读取xpath表,但我找不到正确的参数来返回td值。谢谢大家,我很感激。

import sys
from glob import *
from lxml import etree, html
import requests
#Scan directory (current) and scrape the html files
dirScan = glob('html/*.*')
fileCount = 0
while(fileCount < len(dirScan)):
    fileName =  dirScan[fileCount]
    page = open(fileName)
    tree = html.fromstring(page.read())
   tables = tree.xpath('//table')
   print("Tables:",tables)

page.html中

 <table style="width:100%">
 <tr align="right"><td>1</td><td>John</td><td>Smith</td>
 <tr align="right"><td>2</td><td>Tody</td><td>Miller</td>
</table> 

2 个答案:

答案 0 :(得分:1)

如果你想在 tr 中找到 td的并且右对齐,你需要使用align属性来过滤:

tds = tree.xpath("//table/tr[@align='right']/td")

如果你只想要每个td的文字:

.xpath("//table/tr[@align='right']/td/text()")

但实际上你可能想保留关联,所以你应该找到trs,然后将td文本分组:

x = """<table style="width:100%">
 <tr align="right"><td>1</td><td>John</td><td>Smith</td>
 <tr align="right"><td>2</td><td>Tody</td><td>Miller</td>
</table> """

from lxml import html

tree = html.fromstring(x)

# first get the trs, filtering by attribute 
trs = tree.xpath("//table/tr[@align='right']")

# then extract the tds from each tr 
data = [row.xpath("td/text()") for row in trs]

哪会给你:

[['1', 'John', 'Smith'], ['2', 'Tody', 'Miller']]

如果你只想要每个名字,你可以跳过第一个td:

trs = tree.xpath("//table/tr[@align='right']")

# position() > 1, all but the first td, xpath has one based indexing.
names = [row.xpath("td[position()> 1]/text()") for row in trs])

或加入一个字符串:

 full_names [" ".join(row.xpath("td[position()> 1]/text()")) for row in trs]

答案 1 :(得分:0)

代码

 >>> page="""<table style="width:100%">
      <tr>
        <th>Id</th>
        <th>Name</th>
        <th>Age</th>
      </tr>
      <tr>
        <td>1</td>
        <td>Smith</td>
        <td>50</td>
      </tr>
      <tr>
        <td>2</td>
        <td>Jackson</td>
        <td>94</td>
      </tr>
      <tr>
        <td>3</td>
        <td>Miller</td>
        <td>43</td>
      </tr>
    </table> """
    >>> tree=html.fromstring(s)
    >>> tree.xpath('//tr/td//text()')

输出:

['1', 'Smith', '50', '2', 'Jackson', '94', '3', 'Miller', '43']