Python - Scrapy - 创建一个抓取工具,获取网址列表并对其进行抓取

时间:2017-12-31 13:29:39

标签: python scrapy scrapy-spider

我正在尝试用包装" Scrapy"获取URL列表并抓取它们。我搜索了stackoverflow的答案,但找不到能解决问题的东西。

我的脚本如下:

class Try(scrapy.Spider):
   name = "Try"

   def __init__(self, *args, **kwargs):
      super(Try, self).__init__(*args, **kwargs)
      self.start_urls = kwargs.get( "urls" )
      print( self.start_urls )

   def start_requests(self):
      print( self.start_urls )
      for url in self.start_urls:
          yield Request( url , self.parse )

   def parse(self, response):
      d = response.xpath( "//body" ).extract()

当我爬蜘蛛时:

Spider = Try(urls = [r"https://www.example.com"])
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(Spider)
process.start()

我在打印self.start_urls时打印了以下信息:

  • 在屏幕上打印的 __ init __ 功能中: [r" https://www.example.com"](传递给蜘蛛)。
  • 在屏幕上打印的 start_requests 功能中:无

为什么我没有?有没有其他方法来解决这个问题?或者我的蜘蛛班上有错误吗?

感谢您提供任何帮助!

2 个答案:

答案 0 :(得分:1)

我建议在process.crawl中使用Spider Class并在那里传递urls个参数。

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request


class Try(scrapy.Spider):
   name = 'Try'

   def __init__(self, *args, **kwargs):
      super(Try, self).__init__(*args, **kwargs)
      self.start_urls = kwargs.get("urls")

   def start_requests(self):
      for url in self.start_urls:
          yield Request( url , self.parse )

   def parse(self, response):
      d = response.xpath("//body").extract()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(Try, urls=[r'https://www.example.com'])
process.start()

答案 1 :(得分:0)

如果我跑

process.crawl(Try, urls=[r"https://www.example.com"])

然后按照我的预期将urls发送到Try。甚至我也不需要start_requests

import scrapy

class Try(scrapy.Spider):

   name = "Try"

   def __init__(self, *args, **kwargs):
       super(Try, self).__init__(*args, **kwargs)
       self.start_urls = kwargs.get("urls")

   def parse(self, response):
       print('>>> url:', response.url)
       d = response.xpath( "//body" ).extract()

from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Try, urls=[r"https://www.example.com"])
process.start()

但如果我使用

spider = Try(urls = ["https://www.example.com"])

process.crawl(spider)

然后看起来它在没有Try的情况下运行新的urls,然后列表为空。