Scrapy已抓取0页无法下载pdf

时间:2015-11-18 23:21:02

标签: python web-scraping scrapy

我是scrapy的新手。我试图使用scrapy下载此pdf。我不清楚它为什么不起作用。

import scrapy

class Hawaii_spider(scrapy.Spider):
    name = "hawaii"
    allowed_domains = ["hawaii.edu"]

    def parse_listing(self, response):
        file_urls = ["http://www2.hawaii.edu/~kinzie/documents/CV%20&%20pubs/Kauhako.pdf"]#[start_url + day + end_url for day in days]
        for url in file_urls:
            yield Request(url, callback=self.save_pdf)

    def save_pdf(self, response):
        path = self.get_path(response.url)
        with open(path, "wb") as f:
            f.write(response.body)

1 个答案:

答案 0 :(得分:0)

根据您提供的内容 - 未定义start_urlsstart_requests()方法,因此不会发出任何请求,也不会删除任何内容。

以下是从here下载pdf并将其另存为test.pdf的工作示例:

import scrapy


class Hawaii_spider(scrapy.Spider):
    name = "hawaii"
    allowed_domains = ["hawaii.edu"]

    def start_requests(self):
        file_urls = ["http://www2.hawaii.edu/~kinzie/documents/CV%20&%20pubs/Kauhako.pdf"]#[start_url + day + end_url for day in days]
        for url in file_urls:
            yield scrapy.Request(url, callback=self.save_pdf)

    def save_pdf(self, response):
        with open("test.pdf", "wb") as f:
            f.write(response.body)