从python脚本进程运行的Scrapy只启动url

时间:2015-06-17 17:01:20

标签: python python-2.7 scrapy

我写了一篇Scrapy CrawlSpider

class SiteCrawlerSpider(CrawlSpider):
    name = 'site_crawler'

    def __init__(self, start_url, **kw):
        super(SiteCrawlerSpider, self).__init__(**kw)

        self.rules = (
            Rule(LinkExtractor(allow=()), callback='parse_start_url', follow=True),
        )
        self.start_urls = [start_url]
        self.allowed_domains = tldextract.extract(start_url).registered_domain

    def parse_start_url(self, response):
        external_links = LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response)
        for link in external_links:
            i = FastcrawlerItem()
            i['pageurl'] = response.url
            i['ext_link'] = link.url
            i['ext_domain'] = tldextract.extract(link.url).registered_domain                
            yield i

现在我尝试从另一个Python脚本运行此脚本,如下所示:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy_fastcrawler.spiders.site_crawler import SiteCrawlerSpider
from scrapy.utils.project import get_project_settings

spider = SiteCrawlerSpider(start_url='http://www.health.com/')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

问题:一切运行正常,但这里的主要问题是脚本只处理' start_url'并停止。它不会爬网并移动到启动URL上找到的其他链接,也不会进行任何处理。我还设置了管道,并且start_url中的项目正确地保存到管道设置中。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

当你覆盖抓取蜘蛛的默认parse_start_url时,该方法必须为蜘蛛提供Request s,否则它无法进入任何地方。

在子类化CrawlSpider时,不需要实现此方法,并且从代码的其余部分开始,它看起来像你真的不想要;尝试将您定义的方法更改为parse_page(只是不要将其称为parse)。