Scrapy没有从电子商务网站获得产品

时间:2016-01-19 09:15:21

标签: scrapy

我尝试学习Scrapy,我设法抓取一些网站,其他我失败了例如: 我尝试抓取:http://www.polyhousestore.com/

我创建了一个测试蜘蛛,它将获取页面中的所有产品 http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60

当我运行蜘蛛时,我发现它没有找到任何产品。 有人可以帮助我理解我做错了什么,它是否与CSS :: before和:: after有关? 我怎样才能让它发挥作用?

蜘蛛代码(无法获取页面中的产品)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector

class PolySpider(scrapy.Spider):
    name = "poly"
    allowed_domains = ["polyhousestore.com"]
    start_urls = (
    'http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60',
    )
 def parse(self, response):
    sel = Selector(response)
    products =          sel.xpath('/html/body/div[4]/div/div[5]/div/div/div/div/div[2]/div[3]/div[2]/div')
    items = []
    if not products:
        print '-------------   No products  from sel.xpath'
    else:
        print '-------------   Found products ' + str(len(products))

我运行的命令行和输出

D:\scrapyProj\cmdProj>scrapy crawl poly
2016-01-19 10:23:16 [scrapy] INFO: Scrapy 1.0.3 started (bot: cmdProj)
2016-01-19 10:23:16 [scrapy] INFO: Optional features available: ssl, http11
2016-01-19 10:23:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE':    'cm
dProj.spiders', 'SPIDER_MODULES': ['cmdProj.spiders'], 'BOT_NAME': 'cmdProj'}
2016-01-19 10:23:17 [scrapy] INFO: Enabled extensions: CloseSpider,    TelnetConsol
e, LogStats, CoreStats, SpiderState
2016-01-19 10:23:17 [scrapy] INFO: Enabled downloader middlewares:  HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-19 10:23:17 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-19 10:23:17 [scrapy] INFO: Enabled item pipelines:
2016-01-19 10:23:17 [scrapy] INFO: Spider opened
2016-01-19 10:23:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped   0 i
tems (at 0 items/min)
2016-01-19 10:23:17 [scrapy] DEBUG: Telnet console listening on   127.0.0.1:6023
2016-01-19 10:23:17 [scrapy] DEBUG: Crawled (200) <GET    http://www.polyhousestore
.com/catalogsearch/result/?cat=&q=lc+60> (referer: None)
-------------   No products  from sel.xpath
2016-01-19 10:23:18 [scrapy] INFO: Closing spider (finished)
2016-01-19 10:23:18 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 254,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16091,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 19, 8, 23, 18, 53000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 1, 19, 8, 23, 17, 376000)}
2016-01-19 10:23:18 [scrapy] INFO: Spider closed (finished)

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

当我查看Chrome问题中提供的网址时,我在网站的div中只看到了2个body个标记。这意味着scrapy也会看到这些div标签。但是,您想要访问第4个,这不存在,因此您的搜索不会返回任何元素。

如果我打开一个scrapy shell并对身体上的div标签执行计数,我会得到 2

[<Selector xpath='count(/html/body/div)' data=u'2.0'>]

以上与

相同
 len(response.xpath('/html/body/div'))

所有这些意味着您必须修改查询才能获得所有产品。如果您需要网站中的4个元素,请尝试:

response.xpath('//div[@class="item-inner"]')

正如您所看到的,您不再需要使用scrapy将响应包装到选择器中。