Python:lxml xpath来提取内容

时间:2016-09-07 14:12:36

标签: python-2.7 lxml lxml.html

以下代码能够从下面的路透社链接中提取 PE 。但是,我的方法并不健全,因为另一只股票的网页有两行较少,导致数据转移。我怎么能遇到这个问题。我想直接指出PE的一部分来提取数据,但不知道如何去做。 链接1:http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL 链接2:http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL

from lxml import html
import lxml

page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')
treea = html.fromstring(page2.content)
tree4 = treea.xpath('//td[@class]/text()')
PE= tree4[37]

这是我希望代码只能提取这部分的部分,这样网页的任何更改都不会受到影响。

 <tr class="stripe">
                <td>P/E Ratio (TTM)</td>
                <td class="data">36.79</td>
                <td class="data">25.99</td>
                <td class="data">21.70</td>
            </tr>

1 个答案:

答案 0 :(得分:1)

使用该文本查找第一个 td 然后提取兄弟 td的

 treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')

无论如何都会有效:

In [8]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')

In [9]: treea = html.fromstring(page2.content)    
In [10]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')

In [11]: print(tree4)
['36.79', '25.99', '21.41']

In [12]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL')
In [13]: treea = html.fromstring(page2.content)

In [14]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')

In [15]: print(tree4)
['--', '25.49', '17.30']