为什么WGET在锚标记中跟随rel = nofollow?

时间:2016-06-24 13:11:03

标签: linux web-crawler wget nofollow

我正在尝试下载整个域的HTML,但它也会跟随并下载下面的链接,即使它声明了rel =" nofollow"。

<a href="?s=" rel="nofollow" data-avia-search-tooltip="..." aria-hidden="true" data-av_icon="" data-av_iconfont="entypo-fontello" style="height: 88px; line-height: 88px;"><span class="avia_hidden_link_text">Search</span></a>

我的wget如下:

wget --no-cookies --ignore-tags=link -e robots=on --span-hosts --output-       file=/home/markus/python/test/log.txt http://www.kilnbridge.com --domains kilnbridge.com -x -P /home/markus/python/test -r -E --html-extension  -R gif,jpg,pdf,png,rss,php,zip,rar,z7,css,js,eot,svg,ttf,woff,exe --ignore-length --max-redirect=100  --quota=10000k --wait=0.1 --no-check-certificate --remote-encoding=encoding

我已经尝试了与wget 1.15和1.18的各种组合但没有成功。

输出wget -version:

GNU Wget 1.18 built on linux-gnu.

-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls+ntlm +opie -psl +ssl/openssl

Wgetrc:
/opt/wget/etc/wgetrc (system)
Locale:
/opt/wget/share/locale
Compile:
gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/opt/wget/etc/wgetrc"
-DLOCALEDIR="/opt/wget/share/locale" -I. -I../lib -I../lib
-DHAVE_LIBSSL -DNDEBUG
Link:
gcc -DHAVE_LIBSSL -DNDEBUG -luuid -lssl -lcrypto -lz -lidn
ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a

Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://www.gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Originally written by Hrvoje Niksic <hniksic@xemacs.org>.
Please send bug reports and questions to <bug-wget@gnu.org>.

0 个答案:

没有答案