Question

要使用Ruby保存网页的HTML，这很容易。

一种方法是使用rio：

require 'rubygems'
require 'rio'
rio('http://www.google.com') > rio('google.html')

是否可以通过解析html，再次请求不同的图像，javascript，css然后保存每个图片来做同样的事情？

我觉得效率不高。

那么，有没有办法保存网页+与该页面相关的所有图像，CSS和JavaScript，以及所有这些？

Answer 1

系统怎么样（“wget -r -l 1 http://google.com”）

Answer 2

大部分时间我们都可以使用系统的工具。就像dimus所说，你可以使用wget下载页面。

解决网络问题有很多有用的API。例如net / ftp，net / http或net / https。您可以查看文档以获取详细信息。 Net/HTTP 但是这些方法只能得到响应，我们需要做的更多就是解析HTML文档。使用mozilla的lib更是一种好方法。

Answer 3

url = "docs.zillabyte.com"
output_dir = "/tmp/crawl"

# -E = adjust malformed extensions (e.g. /some_image/ -> /some_image.gif)
# -H = span hosts (e.g. include assets from other domains) 
# -p = download all assets associated with the page
# -P = output prefix (a.k.a the directory to dump the assets)
system("wget -E -H -p '#{url}' -P '#{output_dir}'")

# read files from 'output_dir'

ruby +保存网页

3 个答案: