Python Scrapy-无法抓取

时间:2014-12-03 06:17:14

标签: python scrapy web-crawler

我正在尝试使用scrapy抓取一些网站。下面是一个示例代码。方法解析没有被调用。我试图通过反应堆服务(提供的代码)运行代码。所以,我从startCrawling.py运行它,它有反应器。我知道我错过了什么。你能帮忙吗?

谢谢,

Code-categorization.py

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from items.items import CategorizationItem
from scrapy.contrib.spiders.crawl import CrawlSpider
class TestingSpider(CrawlSpider):
         print 'in spider'
         name = 'testSpider'
         allowed_domains = ['wikipedia.org']
         start_urls = ['http://www.wikipedia.org']
         def parse(self, response):

             # Scrape data from page
             print 'here'
             open('test.html','wb').write(response.body)

Code- startCrawling.py

 from twisted.internet import reactor
 from scrapy.crawler import Crawler
 from scrapy.settings import Settings
 from scrapy import log, signals
 from scrapy.xlib.pydispatch import dispatcher
 from scrapy.utils.project import get_project_settings

 from spiders.categorization import TestingSpider

 # Scrapy spiders script...

 def stop_reactor():
     reactor.stop #@UndefinedVariable    
     print 'hi'

     dispatcher.connect(stop_reactor, signal=signals.spider_closed) 
     spider = TestingSpider()
     crawler = Crawler(Settings())
     crawler.configure()
     crawler.crawl(spider)
     crawler.start()
     reactor.run()#@UndefinedVariable

1 个答案:

答案 0 :(得分:2)

使用parse()时,您不应该覆盖CrawlSpider方法。您应该在callback中使用其他名称设置自定义Rule 以下是official documentation

的摘录
  

编写爬网蜘蛛规则时,请避免使用parse作为回调   CrawlSpider使用parse方法本身来实现其逻辑。   因此,如果您覆盖解析方法,则爬行蜘蛛将不再存在   工作