用于解析Yahoo Finance的Python / lxml / xpath

时间:2012-12-02 06:54:26

标签: python xpath lxml yahoo-finance

编辑:我提供了我正在使用的确切源代码,试图弄清楚这个问题。

我正在尝试使用Python 2.7和lxml从Yahoo Finance中提取“总资产”数据。我试图从中提取此信息的页面示例是 http://finance.yahoo.com/q/bs?s=FAST+Balance+Sheet&annual

我已经成功从Smartmoney中提取了“总资产”的数据。我能够解析的Smartmoney页面的一个示例是 http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view = smi_emptyView

这是我为解决此问题而设置的特殊测试脚本:

    import urllib
    import lxml
    import lxml.html 

    url_local1 = "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView" 
    result1 = urllib.urlopen(url_local1)
    element_html1 = result1.read()
    doc1 = lxml.html.document_fromstring (element_html1)
    list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
    print list_row1

    url_local2 = "http://finance.yahoo.com/q/bs?s=FAST" 
    result2 = urllib.urlopen(url_local2)
    element_html2 = result2.read()
    doc2 = lxml.html.document_fromstring (element_html2)
    list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
    print list_row2

我可以从Smartmoney页面获取有关总资产的数据行,但是当我尝试解析Yahoo Finance页面时,我只得到一个空列表。

Smartmoney页面上表格行的源代码是:

    <tr class="odd bold">
<th><div style='font-weight:bold'>Total Assets</div></th>
<td>  1,684,948</td>
<td>  1,468,283</td>                                
<td>  1,327,358</td>                                
<td>  1,304,149</td>                                    
<td>  1,163,061</td>
    </tr>

Yahoo页面上表格行的源代码是:

    <tr>
<td colspan="2"><strong>Total Assets</strong></td>
<td align="right"><strong>1,684,948&nbsp;&nbsp;</strong></td>
<td align="right"><strong>1,468,283&nbsp;&nbsp;</strong></td>
<td align="right"><strong>1,327,358&nbsp;&nbsp;</strong></td>
    </tr>

1 个答案:

答案 0 :(得分:0)

包含语法错误,最后应为td/strong/text(),此外还有一个尾随]。我会说正确的查询是:

xpath('//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')

结果:

>>> tree.xpath('//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
[u'1,684,948\xa0\xa0', u'1,468,283\xa0\xa0', u'1,327,358\xa0\xa0']

在原始页面中,“总资产”<strong>容器包含空格和换行符。使用normalize-space结果上的其他text()功能,如下所示:

xpath('//td[strong[normalize-space(text())="Total Assets"]]/following-sibling::td/strong/text()')