Python:tr和td标签中的表的强大xpath,消除不需要的数据

时间:2018-02-07 13:37:21

标签: python html xpath lxml

我需要以健壮的方式获取此网址"http://www.screener.com/v2/stocks/view/5131"

的xpath

但是,在它们之间的理想数据之前有太多的空白区域并且它不健壮。

我需要的部分是11.48,9.05,11.53,来自下面的html:

 <div class="table-responsive">
                        <table class="table table-hover">
                            <tr>
                                <th>Financial Year</th>
                                <th class="number">Revenue ('000)</th>
                                <th class="number">Net ('000)</th>
                                <th class="number">EPS</th>
                                <th></th>
                            </tr>

                                                                    <tr>
                                    <td>30 Nov, 2017</td>
                                    <td class="number">205,686</td>
                                    <td class="number">52,812</td>
                                    <td class="number">11.48</td>
                                    <td></td>
                                </tr>

                                                                    <tr>
                                    <td>30 Nov, 2016</td>
                                    <td class="number">191,301</td>
                                    <td class="number">41,598</td>
                                    <td class="number">9.05</td>
                                    <td></td>
                                </tr>

                                                                    <tr>
                                    <td>30 Nov, 2015</td>
                                    <td class="number">225,910</td>
                                    <td class="number">51,082</td>
                                    <td class="number">11.53</td>
                                    <td></td>
                                </tr>

我的代码如下

from lxml import html
import requests
page = requests.get('http://www.screener.com/v2/stocks/view/5131')
output = html.fromstring(page.content)
output.xpath('//tr/td/following-sibling::td/text()')

如何更改代码,以便它可以稳健地从表格中获取三个数字,如上所示?

我只想要输出11.48,9.05,11.53,但我无法摆脱表格中的太多数据

1 个答案:

答案 0 :(得分:0)

尝试使用XPath以获得所需的输出:

//div[@id="annual"]//tr/td[position() = last() - 1]/text()