Question

我正在尝试从https://www.workandincome.govt.nz/map/将所有HTML文件下载到磁盘。我的意思是我需要在https://www.workandincome.govt.nz/map/ URL以＃34; map＆＃34;结尾后获取index.html和所有其他HTML文件。例如，我需要下载：

https://www.workandincome.govt.nz/的地图 /income-support/extra-help/disability-allowance/medical-fees-01.html
https://www.workandincome.govt.nz/的地图 /income-support/extra-help/community-costs/index.html

等等。我不需要从网址中地图所在的同一网站下载任何其他HTML网页。我在下面尝试了wget命令：

wget --limit-rate=200k --recursive --html-extension --convert-links   --random-wait --follow-tags=a -U "Mozilla/5.0 (X11; Linux x86_64)" https://www.workandincome.govt.nz/map/index.html

上面我得到https://www.workandincome.govt.nz/map/index.html，然后是。{ http://www.workandincome.govt.nz/robots.txt然后是我不需要的HTML文件：

www.workandincome.govt.nz/online-services/index.html,www.workandincome.govt.nz/eligibility/index.html

有人可以查看我正在使用的wget命令和建议吗？感谢

Answer 1

您需要使用-A param

wget -A "*map*" --limit-rate=200k --recursive --html-extension --convert-links --random-wait --follow-tags=a -U "Mozilla/5.0 (X11; Linux x86_64)" https://www.workandincome.govt.nz/map/index.html

wget来自特定网址的网站及其子目录

1 个答案: