网站抓取和截图

时间:2014-03-27 12:10:53

标签: python scrapy screenshot

我正在使用scrapy删除网站并将内部/外部链接存储在我的项目类中。

有没有办法在链接被废弃时,我可以捕获它的截图?

注意:网站上有登录授权表。

我的代码(spider.py)

  from scrapy.spider import BaseSpider
  from scrapy.contrib.spiders.init import InitSpider
  from scrapy.http import Request, FormRequest
  from scrapy.selector import HtmlXPathSelector
  from tutorial.items import DmozItem
  from scrapy.contrib.spiders import CrawlSpider, Rule
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
  import urlparse
  from scrapy import log

  class MySpider(CrawlSpider):

      items = []
      failed_urls = []
      duplicate_responses = []

      name = 'myspiders'
      allowed_domains = ['someurl.com']
      login_page = 'someurl.com/login_form'
      start_urls = 'someurl.com/'

      rules = [Rule(SgmlLinkExtractor(deny=('logged_out', 'logout',)), follow=True, callback='parse_start_url')]

      def start_requests(self):

          yield Request(
              url=self.login_page,
              callback=self.login,
              dont_filter=False
              )


      def login(self, response):
          """Generate a login request."""
          return FormRequest.from_response(response,
            formnumber=1,
            formdata={'username': 'username', 'password': 'password' },
            callback=self.check_login_response)


      def check_login_response(self, response):
          """Check the response returned by a login request to see if we are
          successfully logged in.
          """
          if "Logout" in response.body:
              self.log("Successfully logged in. Let's start crawling! :%s" % response, level=log.INFO)
              self.log("Response Url : %s" % response.url, level=log.INFO)

              yield Request(url=self.start_urls)
          else:
              self.log("Bad times :(", loglevel=log.INFO)


      def parse_start_url(self, response):


          # Scrape data from page
          hxs = HtmlXPathSelector(response)

          self.log('response came in from : %s' % (response), level=log.INFO)

          # check for some important page to crawl
          if response.url == 'someurl.com/medical/patient-info' :

              self.log('yes I am here', level=log.INFO)

              urls = hxs.select('//a/@href').extract()
              urls = list(set(urls))


              for url in urls :

                  self.log('URL extracted : %s' % url, level=log.INFO)

                  item = DmozItem()

                  if response.status == 404 or response.status == 500:
                      self.failed_urls.append(response.url)
                      self.log('failed_url : %s' % self.failed_urls, level=log.INFO)
                      item['failed_urls'] = self.failed_urls

                  else :

                      if url.startswith('http') :
                          if url.startswith('someurl.com'):
                              item['internal_link'] = url

                              # Need to capture screenshot of the extracted url here

                              self.log('internal_link :%s' % url, level=log.INFO)
                          else :
                              item['external_link'] = url

                              # Need to capture screenshot of the extracted url here

                              self.log('external_link :%s' % url, level=log.INFO)

                  self.items.append(item)

              self.items = list(set(self.items))
              return self.items
          else :
              self.log('did not recieved expected response', level=log.INFO)

更新:我正在使用虚拟机(通过putty登录)

1 个答案:

答案 0 :(得分:4)

您可以查看像splash

这样的渲染服务器