Question

如果我有网页的网址，如何将其下载到本地，包括所有图片，样式表等？我是否必须手动解析HTML并找出所有外部资源？还是有更清洁的方式？

谢谢！

Answer 1

这是我在其他地方寻找的时间之一。并不是说它不能在Ruby中完成，但是还有其他现有的工具可以很好地完成。为什么重新发明轮子？

看看wget。它是用于检索Web资源（包括镜像站点）的标准工具，可在所有平台上使用。来自the docs：

仅检索一个html页面，但要确保还要显示页面显示所需的所有元素，例如内嵌图像和外部样式表。还要确保下载的页面引用了下载的链接。

wget -p --convert-links http://www.server.com/dir/page.html

html页面将保存到www.server.com/dir/page.html，以及www.server.com/下的图像，样式表等，具体取决于它们在远程服务器上的位置。

您可以使用反引号或%x：

轻松地在Ruby脚本中调用wget

`/path/to/wget -p --convert-links http://www.server.com/dir/page.html`

或

%x{/path/to/wget -p --convert-links http://www.server.com/dir/page.html}

在Ruby中有很多其他机制可以做同样的事情，它可以让你有更多的控制权。

Answer 2

您可以使用Net :: HTTP和Nokogiri轻松地完成此操作（尽管不像只是学习使用'wget'那么容易）：

require 'nokogiri'
require 'net/http'
require 'pathname'

# Set to the host and the path of the HTML file
host = 'rubygems.org'
path = '/'

# Fetch the page and parse it
source = Net::HTTP.get( host, path )
page   = Nokogiri::HTML( source )
dir    = Pathname( path ).dirname

# Download images
page.xpath( '//img[@src]' ).each do |imgtag|
    localpath = Pathname( imgtag[:src] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, imgtag[:src], fh )
    end
end

# Download stylesheets
page.xpath( '//link[@rel="stylesheet"]' ).each do |linktag|
    localpath = Pathname( linktag[:href] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, linktag[:href], fh )
    end
end

你显然需要更好的错误检查，并且需要将资源获取代码提取到一个方法中，但是如果你真的想从Ruby中做到这一点，那肯定是可能的。

Answer 3

如果您只是做了几个实例，我认为您不需要脚本。您只需使用任何Web浏览器保存网页，它就会下载必要的图像和样式表等。或者在chrome中，您可以浏览单个网页中使用的所有资源。

如何在Ruby脚本中备份整个网页（包括图像等）？

3 个答案: