问:scrapy-redis没有刮掉任何页面,只是在一秒钟内完成

时间:2015-07-21 03:52:07

标签: web-scraping scrapy scrapy-spider

我的蜘蛛没有刮擦任何页面,在不到一秒的时间内完成,但没有抛出任何错误。

我已经检查了代码并与几周前成功运行的另一个类似项目进行了比较,但仍然无法弄清楚问题可能是什么。

我使用scrapy 1.0.1和scrapy-redis 0.6。

这是日志:

 2015-07-21 11:33:20 [scrapy] INFO: Scrapy 1.0.1 started (bot: demo)
2015-07-21 11:33:20 [scrapy] INFO: Optional features available: ssl, http11
2015-07-21 11:33:20 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'demo.spiders', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['demo.spiders'], 'RETRY_HTTP_CODES': [500, 502, 503, 504, 400, 408, 404, 302, 403], 'BOT_NAME': 'demo', 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler', 'DEFAULT_ITEM_CLASS': 'demo.items.DemoItem', 'REDIRECT_ENABLED': False}
2015-07-21 11:33:20 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-21 11:33:20 [scrapy] INFO: Enabled downloader middlewares: CustomUserAgentMiddleware, CustomHttpProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-21 11:33:20 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-21 11:33:20 [scrapy] INFO: Enabled item pipelines: RedisPipeline, DemoPipeline
2015-07-21 11:33:20 [scrapy] INFO: Spider opened
2015-07-21 11:33:20 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-21 11:33:20 [scrapy] INFO: Closing spider (finished)
2015-07-21 11:33:20 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 21, 3, 33, 20, 301371),
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 7, 21, 3, 33, 20, 296941)}
2015-07-21 11:33:20 [scrapy] INFO: Spider closed (finished)

这是spider.py

# -*- coding: utf-8 -*-
import scrapy 
from demo.items import DemoItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisMixin
from pip._vendor.requests.models import Request 

class DemoCrawler(RedisMixin, CrawlSpider):
name = "demo"
redis_key = "demoCrawler:start_urls"
rules = (              Rule(LinkExtractor(allow='/shop/\d+?/$',restrict_xpaths=u"//ul/li/div[@class='txt']/div[@class='tit']/a"),callback = 'parse_demo'),
         Rule(LinkExtractor(restrict_xpaths = u"//div[@class='shop-wrap']/div[@class='page']/a[@class='next']"),follow = True)
         )

def parse_demo(self,response):

    item = DemoItem()

    temp = response.xpath(u"//div[@id='basic-info']/div[@class='action']/a/@href").re("\d.+\d")
    item['id'] = temp[0] if temp else ''

    temp = response.xpath(u"//div[@class='page-header']/div[@class='container']/a[@class='city J-city']/text()").extract()
    item['city'] = temp[0] if temp else ''

    temp = response.xpath(u"//div[@class='breadcrumb']/span/text()").extract()
    item['name'] = temp[0] if temp else ''

    temp = response.xpath(u"//div[@class='main']/div[@id='sales']/text()").extract()
    item['deals'] = temp[0] if temp else ''

    temp = response.xpath(u"//div[@class='main-nav']/div[@class='container']/a[1]/text()").extract()
    item['category'] = temp[0] if temp else ''

    temp = response.xpath(u"//div[@class='main']/div[@id='basic-info']/div[@class='expand-info address']/a/span/text()").extract()
    item['region'] = temp[0] if temp else ''

    temp = response.xpath(u"//div[@class='main']/div[@id='basic-info']/div[@class='expand-info address']/span/text()").extract()
    item['address'] = temp[0] if temp else ''

    yield item

要启动蜘蛛,我应该在shell中输入两个命令:

redis-cli lpush demoCrawler:start_urls url

scrapy抓取演示

url 是我抓取的特定网址,例如 http://google.com

0 个答案:

没有答案