scrapy recursive scrape网站

时间:2018-05-23 08:48:25

标签: python scrapy web-crawler scrapy-spider

我想写一个刮刀,访问初始页面的所有子页面。

示例网站是: pydro.com 所以它应该例如提取 pydro.com/impressum 并将其保存为我的硬盘驱动器上的html文件。

我写的代码:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader
from finalproject.items import FinalprojectItem


class ExampleSpider(CrawlSpider):
    name = "projects"  # Spider name
    allowed_domains = ["pydro.com"]  # Which (sub-)domains shall be scraped?
    start_urls = ["https://pydro.com/"]  # Start with this one
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]  # Follow any link scrapy finds (that is allowed).

    def parse_item(self, response):
        print('Got a response from %s.' % response.url)
        self.logger.info('Hi this is an item page! %s', response.url)
        page = response.url.split('.com/')[-1]
        filename = 'pydro.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

当我运行我的蜘蛛时,输出只有pydro.html

我想我需要调整我的文件名,我得到子页面。或者我需要一个for循环吗?

EDIT1: 我编辑了代码以获取所有html页面。但是当我想在另一个网站上运行脚本时,我收到一个错误:

FileNotFoundError: [Errno 2] No such file or directory: 'otego-https://www.otego.de/de/jobs.php'

这就是我运行的脚本:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader

class ExampleSpider(CrawlSpider):
    name = "otego" #Spider name
    allowed_domains = ["otego.de"] # Which (sub-)domains shall be scraped?
    start_urls = ["https://www.otego.de/en/index.php"] # Start with this one
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).

def parse_item(self, response):
    print('Got a response from %s.' % response.url)
    self.logger.info('Hi this is an item page! %s', response.url)
    page = response.url
    filename = 'otego-%s' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

1 个答案:

答案 0 :(得分:1)

您需要创建递归抓取。 “子页面”只是另一个页面,其网址是从“上一页”获得的。您必须向子页面发出第二个请求(其URL应位于变量sel中)并在(第二个)响应中使用xpath。

How to recursively crawl subpages with Scrapy