如何从选定的Xpath上方的Xpath获取文本?

时间:2014-08-20 17:37:19

标签: python xpath web-scraping scrapy

我想使用Scrapy从每个表行中提取数字。

     <tr>  
        <td class="legend left value">1</td>
        <td colspan="4" class="legend title">Corners</td>
        <td class="legend right value">5</td>
      </tr>
      <tr>  
        <td class="legend left value">2</td>
        <td colspan="4" class="legend title">Shots on target</td>
        <td class="legend right value">8</td>
      </tr>
      <tr>  
        <td class="legend left value">3</td>
        <td colspan="4" class="legend title">Shots wide</td>
        <td class="legend right value">8</td>
      </tr>
      <tr>  
        <td class="legend left value">14</td>
        <td colspan="4" class="legend title">Fouls</td>
        <td class="legend right value">14</td>
      </tr>
      <tr>  
        <td class="legend left value">2</td>
        <td colspan="4" class="legend title">Offsides</td>
        <td class="legend right value">4</td>
      </tr>

我已尝试过以下代码的许多不同版本,但到目前为止,没有任何内容返回任何内容,没有任何错误。

P.S这只是我稍后将作为测试的一部分的样本。

corners = hxs.xpath("//tbody/tr/td[contains(., 'Corners')]")
stats ["corners"] = corners.xpath("../td[@class = 'legend right value']/text()").extract()

有谁知道我做错了什么?

2 个答案:

答案 0 :(得分:1)

以下是具有不同阶段的示例scrapy shell会话:

  1. 获取起始页
  2. 抓住包含您之后的统计信息的iframe,并获取src属性
  3. 获取相应的iframe内容(这需要另一个Request,在shell中只使用fetch()
  4. 找到包含数据的表格,只选择偶数位置的行
  5. 每行
  6. ,使用奇数位单元格(1和3)作为数字,使用第二个单元格作为统计名称
  7. 它是这样的:

    scrapy shell "http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01"
    2014-08-21 11:06:19+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
    2014-08-21 11:06:19+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
    2014-08-21 11:06:19+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled item pipelines: 
    2014-08-21 11:06:19+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    2014-08-21 11:06:19+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
    2014-08-21 11:06:19+0200 [default] INFO: Spider opened
    2014-08-21 11:06:19+0200 [default] DEBUG: Crawled (200) <GET http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01> (referer: None)
    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7fcfe7bda550>
    [s]   item       {}
    [s]   request    <GET http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01>
    [s]   response   <200 http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01>
    [s]   settings   <scrapy.settings.Settings object at 0x7fcfe8299ad0>
    [s]   spider     <Spider 'default' at 0x7fcfe7386b10>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
    
    In [1]: import urlparse
    
    In [2]: iframe_src = response.css('div.block_match_stats_plus_chart > iframe::attr(src)').extract()[0]
    
    In [3]: fetch(urlparse.urljoin(response.url, iframe_src))
    2014-08-21 11:06:35+0200 [default] DEBUG: Crawled (200) <GET http://int.soccerway.com/charts/statsplus/1686679/> (referer: None)
    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7fcfe7bda550>
    [s]   item       {}
    [s]   request    <GET http://int.soccerway.com/charts/statsplus/1686679/>
    [s]   response   <200 http://int.soccerway.com/charts/statsplus/1686679/>
    [s]   settings   <scrapy.settings.Settings object at 0x7fcfe8299ad0>
    [s]   spider     <Spider 'default' at 0x7fcfe7386b10>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
    
    In [4]: stats = {}
    
    In [5]: for row in response.css('div.chart > table > tr:nth-child(even)'):
        name = row.css('td:nth-child(even)::text').extract()[0]
        stats[name] = map(int, row.css('td:nth-child(odd)::text').extract())
       ...:     
    
    In [6]: stats
    Out[6]: 
    {u'Corners': [1, 5],
     u'Fouls': [14, 14],
     u'Offsides': [2, 4],
     u'Shots on target': [2, 8],
     u'Shots wide': [3, 8]}
    
    In [7]: 
    

答案 1 :(得分:0)

您可以尝试此XPath查询,我使用此online XPath tool

成功运行它

<强> HTML

<table> 
      <tr>  
        <td class="legend left value">1</td>
        <td colspan="4" class="legend title">Corners</td>
        <td class="legend right value">5</td>
      </tr>
      <tr>  
        <td class="legend left value">2</td>
        <td colspan="4" class="legend title">Shots on target</td>
        <td class="legend right value">8</td>
      </tr>
      <tr>  
        <td class="legend left value">3</td>
        <td colspan="4" class="legend title">Shots wide</td>
        <td class="legend right value">8</td>
          </tr>
      <tr>  
        <td class="legend left value">1</td>
        <td colspan="4" class="legend title">Corners</td>
        <td class="legend right value">8</td>
      </tr>
      <tr>  
        <td class="legend left value">14</td>
        <td colspan="4" class="legend title">Fouls</td>
        <td class="legend right value">14</td>
      </tr>
      <tr>  
        <td class="legend left value">2</td>
        <td colspan="4" class="legend title">Offsides</td>
        <td class="legend right value">4</td>
      </tr>
      <tr>  
        <td class="legend left value">1</td>
        <td colspan="4" class="legend title">Corners</td>
        <td class="legend right value">3</td>
      </tr>
</table>

<强>的XPath

//td[@class="legend title" and contains(text(), "corner")]/following-sibling::td[1]/text()

<强>结果

5
8
3