Question

为什么以下命令能够从index.html下载www.example.com？

wget --reject-regex .* http://www.example.com/

$ wget --reject-regex .* http://www.example.com/
--2018-03-05 11:21:26--  http://.keystone_install_lock/
Resolving .keystone_install_lock... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘.keystone_install_lock’
--2018-03-05 11:21:26--  http://www.example.com/
Resolving www.example.com... 93.184.216.34
Connecting to www.example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: ‘index.html’

index.html                                                    100%[=================================================================================================================================================>]   1.24K  --.-KB/s    in 0s

2018-03-05 11:21:27 (4.49 MB/s) - ‘index.html’ saved [1270/1270]

FINISHED --2018-03-05 11:21:27--
Total wall clock time: 0.4s
Downloaded: 1 files, 1.2K in 0s (4.49 MB/s)

wget的手册页

- accept-regex urlregex

- reject-regex urlregex

指定正则表达式以接受或拒绝完整的URL。

和正则表达式.*匹配所有内容。（您可以使用freeformatter.com）

进行验证

我认为由于wget选项，所有--reject-regex .*次下载都会被拒绝。

.*匹配www.example.com，不是吗？

为什么不忽略www.example.com中的所有内容？

Answer 1

--regect-regex只会拒绝index.html中的网址链接，而不会拒绝标记文字。例如，如果网站包含CSS文件main.css的URL，则此命令将递归下载网站，但不包括main.css：

wget -r --reject-regex 'main.css' www.somewebsite.com

要忽略网站上的某些文字，请使用sed。几个例子：

# Ignores the word 'Sans'
wget -qO- example.com | sed "s/Sans//g" > index.html

# Ignores everything
wget -qO- example.com | sed "s/.*//g" > index.html

Answer 2

使用-np选项拒绝索引文件。 --reject-regex仅适用于递归文件（索引文件中的任何链接）。

   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.
       This is a useful option, since it guarantees that only the
       files below a certain hierarchy will be downloaded.

Answer 3

部分答案是您的命令中的 .* 可能被您的 shell 扩展为当前工作目录中匹配文件名的列表，因为它没有用适当的引号括起来。您得到的输出中的 .keystone_install_lock 可能是您当前工作目录中的文件名。 wget 甚至会在尝试连接到 www.example.com 之前报告它。试试

wget --reject-regex '.*' http://www.example.com/

或者可能使用 "" 而不是 ''，具体取决于您使用的 shell。

使用该命令我仍然检索到 index.html，所以我的答案不完整。

使用 Quantum7 建议的 -np 我仍然得到 index.html，所以这也没有完成答案。

wget`--reject-regex`不工作？

3 个答案: