Question

我正在尝试进行一些Web抓取，并且编写了一个简单的脚本，该脚本旨在打印网页中存在的所有URL。我不知道为什么它会传递许多URL，并从中间而不是从第一个URL打印列表。

from urllib import request
from bs4 import BeautifulSoup

source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")

for links in soup.select('a'):
    print(links['href'])

为什么呢？谁能解释我发生了什么事？

我正在使用Python 3.7.1，OS Windows 10-Visual Studio代码

Answer 1

通常，hrefs 只提供部分（不完整）的 url。不用担心。在新选项卡/浏览器中打开它。找到 url 的缺失部分。将其作为字符串添加到 href 中。

在这种情况下，必须是“http://www.bda-ieo.it/test/”。

这是您的代码。

from urllib import request
from bs4 import BeautifulSoup

source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")

for links in soup.select('a'):
    print('http://www.bda-ieo.it/test/' + links['href'])

这就是结果。

http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1

跳过网址抓取

1 个答案: