Question

我试图在pyparsing上保存wikispaces.com项目副本，然后在月底将维基空间缩小。

看起来很奇怪（也许我的谷歌版本被破坏了^ _ ^）但我找不到任何复制/复制网站的例子。也就是说，就像在浏览器上查看它一样。 SO在主题上有this和this，但它们只是保存网站的文本，严格来说是HTML / DOM结构。除非我误认为这些asnwers似乎没有保存图像/标题链接文件/ javascript和渲染页面所需的相关信息。我看到的其他例子更关注页面部分的提取而不是按原样重复。

我想知道是否有人对这类事情有任何经验，或者可以指出我在某处有用的博客/文档。我过去曾使用WinHTTrack，但robots.txt或pyparsing.wikispaces.com/auth/路线阻止其正常运行，我想我会得到一些scrapy经验。

对于那些有兴趣看到我迄今为止尝试过的人。这是我的抓取蜘蛛实现，它确认robots.txt文件

import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class PyparsingSpider(CrawlSpider):
 name = 'pyparsing'
 allowed_domains = ['pyparsing.wikispaces.com']
 start_urls = ['http://pyparsing.wikispaces.com/']

 rules = (
     Rule(LinkExtractor(), callback='parse_item', follow=True),
 )

 def parse_item(self, response):
#   i = {}
#   #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#   #i['name'] = response.xpath('//div[@id="name"]').extract()
#   #i['description'] = response.xpath('//div[@id="description"]').extract()
#   return i
  page = urlparse(response.url)
  path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
  if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
  path = path.with_suffix(".html")
  with open(path, 'wb') as file:
   file.write(response.body)

使用sitemap spider尝试相同的操作是类似的。第一个SO链接提供了一个普通蜘蛛的实现。

import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path

class PyParsingSiteMap(SitemapSpider) :

 name = "pyparsing"
 sitemap_urls = [ 
                  'http://pyparsing.wikispaces.com/sitemap.xml', 
#                   'http://pyparsing.wikispaces.com/robots.txt', 
                ]
 allowed_domains = ['pyparsing.wikispaces.com']
 start_urls = ['http://pyparsing.wikispaces.com'] # "/home"
 custom_settings = {
  "ROBOTSTXT_OBEY" : False
 }

 def parse(self, response) :
  page = urlparse(response.url)
  path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
  if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
  path = path.with_suffix(".html")
  with open(path, 'wb') as file:
   file.write(response.body)

这些蜘蛛都不会收集更多的HTML结构

此外，我发现保存的链接<a href="...">...</a>似乎并未指向正确的相对路径。至少，当打开保存的文件时，链接指向相对于硬盘驱动器而不是相对于文件的路径。在通过http.server打开网页时，链接指向死区，大概.html扩展名就是问题所在。可能需要重新映射/替换存储结构中的链接。

Scrapy：保留网站

0 个答案: