Question

我需要下载所有带有特定网站文件夹（/ content /）图像的网页。尝试访问该文件夹会产生403错误，但所有页面链接都在索引中。他们都有相同的模式＆＃34; content.php？id = xx＆＃34;其中＆＃39; xx＆＃39;是两位到四位数的任何数字。

我的想法是下载所有网站并删除除“内容”之外的所有内容。文件夹，这将是非常时间/带消耗，因为这是一个cronjob，需要运行多次。其他方法是编写一个bash脚本，如：

wget -k -p http://www.example.com/content/content.php?id{{x}}

我如何使用wget设置一个变量，假设它是一个bash脚本，下载所有的id页面（可能使用for循环？）？

Answer 1

由于有索引，理想情况下，您可以wget跟踪索引中的链接，但只过滤您想要的网址而不是整个网站。 curl无法解析HTML并关注其中的链接，但wget可以。

wget有-A / -R accept/reject glob expressions或--accept-regex / --reject-regex。

wget -p -k --recursive --level=1 -A '*/content.php?id=*'  http://www.example.com/content/index.php

根据需要调整接受模式，以避免抓取整个网站，但仍然包含您想要的内容。 wget对html与其他文件类型使用接受/拒绝规则的方式有点复杂，请参阅文档（我链接了。向下滚动到接受/拒绝模式部分的底部）。

强制抓取的最简单方法是curl而不是wget，因为它有范围表达式。它还将为多个请求重用相同的HTTP连接，而不是为每个请求使用新的TCP连接锤击服务器。（wget默认使用HTTP keep-alive，但它显然只有在你的命令行上放置多个URL时才有效，而不是为每个URL分别运行它。）

curl -L --remote-name-all --compressed --remote-time --fail 'http://www.example.com/content/content.php?id=[00-9999]'

请注意带有范围表达式的URL周围的单引号，因为您需要使用curl来查看它，而不是将bash视为glob或大括号表达式。

--remote-name-all：使用基于远程名称的文件名保存文件，而不是stdout。较旧的curl过去需要为cmdline上的每个网址格式提供-O。
-L：关注重定向（--location）。
--fail：在服务器错误（如404）上无声地失败，而不是保存ErrorDocument。
--compressed：允许gzip传输编码。
--remote-time：根据远程模式时间设置本地文件时间戳。

测试它是否正确，看起来不错：

$ curl -L --remote-name-all --compressed --remote-time --fail 'http://www.example.com/content/content.php?id=[00-9999]'

[1/10000]: http://www.example.com/content/content.php?id=00 --> content.php?id=00
--_curl_--http://www.example.com/content/content.php?id=00
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (22) The requested URL returned error: 404 Not Found

[2/10000]: http://www.example.com/content/content.php?id=01 --> content.php?id=01
--_curl_--http://www.example.com/content/content.php?id=01
curl: (22) The requested URL returned error: 404 Not Found

[3/10000]: http://www.example.com/content/content.php?id=02 --> content.php?id=02

...

[100/10000]: http://www.example.com/content/content.php?id=99 --> content.php?id=99
--_curl_--http://www.example.com/content/content.php?id=99
curl: (22) The requested URL returned error: 404 Not Found

[101/10000]: http://www.example.com/content/content.php?id=100 --> content.php?id=100
--_curl_--http://www.example.com/content/content.php?id=100
curl: (22) The requested URL returned error: 404 Not Found

...

Answer 2

怎么样

for id in $(seq 99 9999); do
    wget -k -p http://www.example.com/content/content.php?id=$id
done

这假设使用了所有两到四位数的ID，否则你会收到很多错误。

如果有更多信息，可能会有更好的解决方案。

使用bash下载具有wget的id的特定网站文件夹的所有网页

2 个答案: