我是scrapy的新手。我试图使用scrapy下载此pdf。我不清楚它为什么不起作用。
import scrapy
class Hawaii_spider(scrapy.Spider):
name = "hawaii"
allowed_domains = ["hawaii.edu"]
def parse_listing(self, response):
file_urls = ["http://www2.hawaii.edu/~kinzie/documents/CV%20&%20pubs/Kauhako.pdf"]#[start_url + day + end_url for day in days]
for url in file_urls:
yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f:
f.write(response.body)
答案 0 :(得分:0)
根据您提供的内容 - 未定义start_urls
或start_requests()
方法,因此不会发出任何请求,也不会删除任何内容。
以下是从here下载pdf并将其另存为test.pdf
的工作示例:
import scrapy
class Hawaii_spider(scrapy.Spider):
name = "hawaii"
allowed_domains = ["hawaii.edu"]
def start_requests(self):
file_urls = ["http://www2.hawaii.edu/~kinzie/documents/CV%20&%20pubs/Kauhako.pdf"]#[start_url + day + end_url for day in days]
for url in file_urls:
yield scrapy.Request(url, callback=self.save_pdf)
def save_pdf(self, response):
with open("test.pdf", "wb") as f:
f.write(response.body)