Question

我想递归地进行爬网，但是仅当域是相同的（在我们的代码中为parent_domain）或子域时（如果它是另一个域），我只想爬网一次而不是递归。（我的意思是，假设只有一次，如果我们在爬行somedomain.com时发现了somedomain2.com，那么我们希望打印此somedomain2.com但不对其进行深入我正在使用beautifulsoup4进行爬行。

根据我的理解，我们将必须在队列中添加链接，并逐个选择它们，然后以递归方式对其进行爬网，但是如果此链接中还包含一堆链接，则链接中将存在队列，这对我来说变得越来越复杂编码。还有我们怎么知道需要多少个嵌套队列，因为它可能又有很多链接。

在我看来，这些队列就是这样

Link1
  Link1-Link1
     Link1-Link1-Link1
         <like this it can also have bunch of other links>
     Link1-Link1-Link2
     Link1-Link1-Link3
  Link1-Link2
  Link1-Link3
Link2
Link3
Link4
Link5

当前正在使用的代码为

# modules in my script
.
.
. # Code for something else
.
.

#Now comes the crawler code.

parent_domain = "https://somedomain.com/"

response = requests.get(parent_domain) # Consider requests module imported

code = response.text

soup = BeautifulSoup(code) # Consider bs module imported

for url in soup.find_all('a'): #this would look for every anchor <a> tag
    print(url.get('href'))   #this would give us value of href attribute frmo the <a> tag

我已经尽可能评论了，所以每个人都理解。

如何递归爬网这些链接

0 个答案: