非常类似于这个帖子:Scrapy crawl from script always blocks script execution after scraping,我无法在reactor.run()行之后得到任何工作。我已经阅读了关于该主题的几乎所有SO帖子,正如您从评论代码中看到的那样,我已经尝试了几个方面,包括文档中推荐的内容。有没有我没有抓到的东西?也许parse_item方法有问题吗?这让我发疯了!
class EmailSpider(CrawlSpider):
name = "email_scraper"
allowed_domains = ["somedomain.com"]
start_urls = ["http://www.somedomain.com"]
rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_items')]
def parse_items(self, response):
sel=Selector(response)
results=[]
item=EmailScraperItems()
item['title']=sel.xpath('//title/text()').extract()
item['url']=response.url
item['email']=sel.re(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b")
if item['email'] != []:
print item['email']
print item['url']
if any('info' in email for email in item['email']):
results.append(item)
raise CloseSpider('info email found')
else:
results.append(item)
print results
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = EmailSpider(domain='knechtproperties.com')
#settings = get_project_settings()
crawler = Crawler(Settings())
#crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
print "this will not print"
答案 0 :(得分:0)
在这个主题中找到答案:Scrapy run from script not working。显然log.start()掩盖打印。我需要查找有关其工作原理的更多详细信息,但现在评论它解决了这个问题。