从脚本调用Scrapy Spider - 无法关闭反应堆?

时间:2014-01-05 23:36:49

标签: python python-2.7 scrapy

非常类似于这个帖子:Scrapy crawl from script always blocks script execution after scraping,我无法在reactor.run()行之后得到任何工作。我已经阅读了关于该主题的几乎所有SO帖子,正如您从评论代码中看到的那样,我已经尝试了几个方面,包括文档中推荐的内容。有没有我没有抓到的东西?也许parse_item方法有问题吗?这让我发疯了!

class EmailSpider(CrawlSpider):
    name = "email_scraper"
    allowed_domains = ["somedomain.com"]
    start_urls = ["http://www.somedomain.com"]
    rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_items')]

    def parse_items(self, response):
        sel=Selector(response)
        results=[]
        item=EmailScraperItems()
        item['title']=sel.xpath('//title/text()').extract()
        item['url']=response.url
        item['email']=sel.re(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b")
        if item['email'] != []:
            print item['email']
            print item['url']
            if any('info' in email for email in item['email']):
                results.append(item)
                raise CloseSpider('info email found')
            else:
                results.append(item)      

        print results

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = EmailSpider(domain='knechtproperties.com')
#settings = get_project_settings()
crawler = Crawler(Settings())

#crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

print "this will not print"

1 个答案:

答案 0 :(得分:0)

在这个主题中找到答案:Scrapy run from script not working。显然log.start()掩盖打印。我需要查找有关其工作原理的更多详细信息,但现在评论它解决了这个问题。