Python:需要在表中抓取将使用lxml扩展的数据

时间:2016-11-14 15:30:48

标签: python xpath lxml

我想得到季度结果Profit Attb。来自http://klse.i3investor.com/servlets/stk/fin/8982.jsp

的SH(' 000)
from kivy.app import App

from kivy.uix.label import Label

from kivy.uix.boxlayout import BoxLayout

class MyWindow(App):

    def build(self):
        box = BoxLayout()
        label = Label(text='Hello World')
        box.add_widget(label)
        return box

window = MyWindow()

window.run()

例如,有年度结果表和季度结果表,我只需要过去两个季度的Profit Attb数据。到SH(' 000),这是10,398和5,068 enter image description here

然而,该表每个季度都在扩大。我希望有一种强大的方法来使用lxml,xpath或cssselect来检索数据。因此,当下一季度的数据到来时,我的代码仍在运行。

<b>Quarter Result:</b><br/> <table cellpadding="0" cellspacing="0" border="0" class="nc" width="100%"> <tr> <th class="left">F.Y.</th> <th class="left">Quarter</th> <th class="right">Revenue ('000)</th> <th class="right">Profit before Tax ('000)</th> <th class="right">Profit ('000)</th> <th class="right">Profit Attb. to SH ('000)</th> <th class="right">EPS (Cent)</th> <th class="right">DPS (Cent)</th> <th class="right">NAPS</th> <th class="center" width="33"></th> </tr>  <tr> <td class="left" valign="top" nowrap="nowrap"> 2016-12-31 </td> <td class="left" valign="top" nowrap="nowrap"> 2016-09-30 </td> <td class="right" valign="top" nowrap="nowrap"> 79,082 </td> <td class="right" valign="top" nowrap="nowrap"> 14,376 </td> <td class="right" valign="top" nowrap="nowrap"> 10,692 </td> <td class="right" valign="top" nowrap="nowrap"> 10,398 </td> <td class="right" valign="top" nowrap="nowrap"> 3.37 </td> <td class="right" valign="top" nowrap="nowrap"> 0.00 </td> <td class="right" valign="top" nowrap="nowrap"> 1.5100 </td> <td class="center" valign="top" nowrap="nowrap"> <a href="" onclick="viewFinancialSource('62459');return false;" title="View Source"> <img class="sp view16" src="http://cdn1.i3investor.com/cm/icon/trans16.gif" width="16px;" alt="View Source"/> </a> 
 <span class="hide" id="financialSourceTitle62459"> Quarter: 2016-09-30 </span> <span class="hide" id="financialSourceDetail62459">  <p> <a target="_blank" href ="/servlets/staticfile/290836.jsp"> <img src="http://cdn1.i3investor.com/cm/icon/file-download-small.png" width="16px" height="16px" alt="3rd Q2016_CGB.PDF"/> 3rd Q2016_CGB.PDF </a> </p>  </span>  </td> </tr>  <tr> <td class="left" valign="top" nowrap="nowrap"> 2016-12-31 </td> <td class="left" valign="top" nowrap="nowrap"> 2016-06-30 </td> <td class="right" valign="top" nowrap="nowrap"> 51,277 </td> <td class="right" valign="top" nowrap="nowrap"> 7,050 </td> <td class="right" valign="top" nowrap="nowrap"> 5,364 </td> <td class="right" valign="top" nowrap="nowrap"> 5,068 </td> <td class="right" valign="top" nowrap="nowrap"> 1.64 </td> <td class="right" valign="top" nowrap="nowrap"> 0.00 </td> <td class="right" valign="top" nowrap="nowrap"> 1.4800 </td> <td class="center" valign="top" nowrap="nowrap"> <a href="" onclick="viewFinancialSource('56288');return false;" title="View Source"> <img class="sp view16" src="http://cdn1.i3investor.com/cm/icon/trans16.gif" width="16px;" alt="View Source"/> </a> 

但返回空白[&#39;&#39;&#39;&#39;&#39;&#39;&#39;&#39;&#39;]

1 个答案:

答案 0 :(得分:0)

尝试使用索引位置()

import requests
from lxml import html

response = requests.get("http://klse.i3investor.com/servlets/stk/fin/8982.jsp")
tree = html.fromstring(response.content)

profit_attb_to_sh = [x.strip() for x in tree.xpath('//table[2]/tr[position()>1 and position()<=3]/td[@class="right"][4]/text()')]

print '\n'.join(profit_attb_to_sh)
  

表格[2] 表示:   你将从DOM获得第二个表。

     

tr [position()&gt; 1和position()&lt; = 3] 表示:您将获得该索引范围内的每个 tr (2和3) )。排除第一个,因为它包含标题。

     

td [@ class =“right”] [4] 表示:您将从 tr td >符合上述规则。

     

请记住,XPath索引从1开始(而不是从0开始)

<强>结果:

10,398
5,068

<强> PS:

你可以改变:

profit_attb_to_sh = [x.strip() for x in tree.xpath('//table[2]/tr[position()>1 and position()<=3]/td[@class="right"][4]/text()')]

有:

profit_attb_to_sh = map(lambda x: x.strip(), tree.xpath('//table[2]/tr[position()>1 and position()<=3]/td[@class="right"][4]/text()'))