爬到下一页并下载指定类型的文件

时间:2018-11-13 05:52:24

标签: python web-scraping scrapy scrapy-spider web-scripting

我是Scrapy&Python的新手,我想让scrapy转到下一页并下载特定类型的文件(例如:EX-10.1,Ex-10.2)所有EX-10类型的文件。我的代码正在运行,但未下载任何文件。

我的代码

 import urlparse

 from scrapy.http import Request
 from scrapy.spiders import BaseSpider

 class legco(BaseSpider):
 name = "sec_gov"

 allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
 start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


 #extract search results
 def parse(self, response): 
 for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
    req = Request(url = link, callback = self.parse_page)
    yield req

 #Saving htm files
 def parse(self, response):
 base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
 for a in response.xpath('//a[@href]/@href'):
    link = a.extract()
    if link.endswith('.htm'):
        link = urlparse.urljoin(base_url, link)
        yield Request(link, callback = self.save_pdf)

 def save_pdf(self, response):
 path = response.url.split('/')[-1]
 with open(path, 'wb') as f:
    f.write(response.body)

我在代码中犯了什么错误?任何人都可以帮我这个忙,谢谢。

0 个答案:

没有答案