我想使用Scrapy从每个表行中提取数字。
<tr>
<td class="legend left value">1</td>
<td colspan="4" class="legend title">Corners</td>
<td class="legend right value">5</td>
</tr>
<tr>
<td class="legend left value">2</td>
<td colspan="4" class="legend title">Shots on target</td>
<td class="legend right value">8</td>
</tr>
<tr>
<td class="legend left value">3</td>
<td colspan="4" class="legend title">Shots wide</td>
<td class="legend right value">8</td>
</tr>
<tr>
<td class="legend left value">14</td>
<td colspan="4" class="legend title">Fouls</td>
<td class="legend right value">14</td>
</tr>
<tr>
<td class="legend left value">2</td>
<td colspan="4" class="legend title">Offsides</td>
<td class="legend right value">4</td>
</tr>
我已尝试过以下代码的许多不同版本,但到目前为止,没有任何内容返回任何内容,没有任何错误。
P.S这只是我稍后将作为测试的一部分的样本。
corners = hxs.xpath("//tbody/tr/td[contains(., 'Corners')]")
stats ["corners"] = corners.xpath("../td[@class = 'legend right value']/text()").extract()
有谁知道我做错了什么?
答案 0 :(得分:1)
以下是具有不同阶段的示例scrapy shell会话:
src
属性Request
,在shell中只使用fetch()
)它是这样的:
scrapy shell "http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01"
2014-08-21 11:06:19+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-08-21 11:06:19+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-08-21 11:06:19+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled item pipelines:
2014-08-21 11:06:19+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-08-21 11:06:19+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-08-21 11:06:19+0200 [default] INFO: Spider opened
2014-08-21 11:06:19+0200 [default] DEBUG: Crawled (200) <GET http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fcfe7bda550>
[s] item {}
[s] request <GET http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01>
[s] response <200 http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01>
[s] settings <scrapy.settings.Settings object at 0x7fcfe8299ad0>
[s] spider <Spider 'default' at 0x7fcfe7386b10>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: import urlparse
In [2]: iframe_src = response.css('div.block_match_stats_plus_chart > iframe::attr(src)').extract()[0]
In [3]: fetch(urlparse.urljoin(response.url, iframe_src))
2014-08-21 11:06:35+0200 [default] DEBUG: Crawled (200) <GET http://int.soccerway.com/charts/statsplus/1686679/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fcfe7bda550>
[s] item {}
[s] request <GET http://int.soccerway.com/charts/statsplus/1686679/>
[s] response <200 http://int.soccerway.com/charts/statsplus/1686679/>
[s] settings <scrapy.settings.Settings object at 0x7fcfe8299ad0>
[s] spider <Spider 'default' at 0x7fcfe7386b10>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [4]: stats = {}
In [5]: for row in response.css('div.chart > table > tr:nth-child(even)'):
name = row.css('td:nth-child(even)::text').extract()[0]
stats[name] = map(int, row.css('td:nth-child(odd)::text').extract())
...:
In [6]: stats
Out[6]:
{u'Corners': [1, 5],
u'Fouls': [14, 14],
u'Offsides': [2, 4],
u'Shots on target': [2, 8],
u'Shots wide': [3, 8]}
In [7]:
答案 1 :(得分:0)
您可以尝试此XPath查询,我使用此online XPath tool
成功运行它<强> HTML 强>
<table>
<tr>
<td class="legend left value">1</td>
<td colspan="4" class="legend title">Corners</td>
<td class="legend right value">5</td>
</tr>
<tr>
<td class="legend left value">2</td>
<td colspan="4" class="legend title">Shots on target</td>
<td class="legend right value">8</td>
</tr>
<tr>
<td class="legend left value">3</td>
<td colspan="4" class="legend title">Shots wide</td>
<td class="legend right value">8</td>
</tr>
<tr>
<td class="legend left value">1</td>
<td colspan="4" class="legend title">Corners</td>
<td class="legend right value">8</td>
</tr>
<tr>
<td class="legend left value">14</td>
<td colspan="4" class="legend title">Fouls</td>
<td class="legend right value">14</td>
</tr>
<tr>
<td class="legend left value">2</td>
<td colspan="4" class="legend title">Offsides</td>
<td class="legend right value">4</td>
</tr>
<tr>
<td class="legend left value">1</td>
<td colspan="4" class="legend title">Corners</td>
<td class="legend right value">3</td>
</tr>
</table>
<强>的XPath 强>
//td[@class="legend title" and contains(text(), "corner")]/following-sibling::td[1]/text()
<强>结果强>
5
8
3