Question

我写的Sitemap像下面这样：

class filmnetmapSpider(SitemapSpider):
      name = "filmnetmapSpider"
      sitemap_urls = ['http://filmnet.ir/sitemap.xml']
      sitemap_rules = [
            ('/series/', 'parse_item')
      ]
      def parse_item(self, response):
         videoid = response.xpath('/loc/text()').extract()

并从其中提取所有网址；

我想再写一个拼凑的蜘蛛，其中start_url是前一个蜘蛛（sitemapSpider）的输出

我该怎么做??

Answer 1

您需要某种数据库或文件来存储一个蜘蛛的结果并在另一个蜘蛛中读取它们。

class FirstSpider(Spider):
    """First spider crawls something end stores urls in file, 1 url per newline"""
    name = 'first'
    start_urls = ['someurl']
    storage_file = 'urls.txt'

    def parse(self, response):
        urls = response.xpath('//a/@href').extract()
        with open(self.storage_file, 'a') as f:
            f.write('\n'.join(urls) + '\n')

class SecondSpider(Spider):
    """Second spider opens this file and crawls every line in it"""
    name = 'second'

    def start_requests(self):
        file_lines = open(FirstSpider.storage_file)
        for line in file_lines:
            if not line.strip():  # skip empty lines 
                continue
            yield Request(line.strip())

Answer 2

假设您从第一个Spider获得csv格式的输出，下面的代码将逐行读取该文件并使用xpath对其进行抓取。

class Stage2Spider(scrapy.Spider):
name = 'stage2'
allowed_domains = []
start_urls = []
read_urls = open('collecturls.csv', 'r')
for url in read_urls.readlines():
    url = url.strip() 
    allowed_domains = allowed_domains + [url[4:]]
    start_urls = start_urls + [url]
read_urls.close()

希望有帮助。

如何编写scrapy哪个start_url是上一个蜘蛛的输出？

2 个答案: