我想写一个刮刀,访问初始页面的所有子页面。
示例网站是: pydro.com 所以它应该例如提取 pydro.com/impressum 并将其保存为我的硬盘驱动器上的html文件。
我写的代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader
from finalproject.items import FinalprojectItem
class ExampleSpider(CrawlSpider):
name = "projects" # Spider name
allowed_domains = ["pydro.com"] # Which (sub-)domains shall be scraped?
start_urls = ["https://pydro.com/"] # Start with this one
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
print('Got a response from %s.' % response.url)
self.logger.info('Hi this is an item page! %s', response.url)
page = response.url.split('.com/')[-1]
filename = 'pydro.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
当我运行我的蜘蛛时,输出只有pydro.html
。
我想我需要调整我的文件名,我得到子页面。或者我需要一个for循环吗?
EDIT1: 我编辑了代码以获取所有html页面。但是当我想在另一个网站上运行脚本时,我收到一个错误:
FileNotFoundError: [Errno 2] No such file or directory: 'otego-https://www.otego.de/de/jobs.php'
这就是我运行的脚本:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader
class ExampleSpider(CrawlSpider):
name = "otego" #Spider name
allowed_domains = ["otego.de"] # Which (sub-)domains shall be scraped?
start_urls = ["https://www.otego.de/en/index.php"] # Start with this one
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
print('Got a response from %s.' % response.url)
self.logger.info('Hi this is an item page! %s', response.url)
page = response.url
filename = 'otego-%s' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
答案 0 :(得分:1)
您需要创建递归抓取。 “子页面”只是另一个页面,其网址是从“上一页”获得的。您必须向子页面发出第二个请求(其URL应位于变量sel中)并在(第二个)响应中使用xpath。