Question

我正在开发一个抓取工具。我陷入了一种情况，即页面上的href文本在该域下的其他页面上不断重复。例如，如果url是example.com，则这些页面上的href值为hrefList = [/ hello / world，/ aboutus，/ blog，/ contact]。

这些网页的网址就是这样 example.com/hello/world example.com/aboutus 等

现在在example.com/hello/world页面上，hrefList再次出现。因此，我会得到网址 example.com/hello/world/hello/world， example.com/hello/world/aboutus等

现在这些页面中的/ hello / world / hello / world是一个http状态为200的正确页面，这是递归发生的。其余的页面将找不到页面，因此可以被丢弃

我正在获取不正确网址的新网址列表。有没有办法克服这个问题？

这是我的代码库：

for url in allUrls:
    if url not in visitedUrls:
        visitedUrls.append(url)

        http=httplib2.Http()
        response,content=http.request(url,headers={'User-Agent':'Crawler-Project'})        
        if (response.status/100<4):
            soup=BeautifulSoup(content)
            links=soup.findAll('a',href=True)
            for link in links:
                if link.has_key('href'):
                    if len(link['href']) > 1:
                        if not any(x in link['href'] for x in ignoreUrls):
                            if link['href'][0]!="#":
                                if "http" in link["href"]:
                                    allUrls.append(link["href"])
                                else:
                                    if url[-1]=="/" and link['href'][0]=="/":
                                        allUrls.append(url+link['href'][1:])
                                    else:       
                                        if not (url[-1] =="/" or link['href'][0] =="/"): 
                                            allUrls.append(url+"/"+link['href'])
                                        else:
                                            allUrls.append(url+link['href'])

Answer 1

如果我们假设您获得的页面相同，则可能的解决方法是创建页面的哈希，并确保您不会使用相同的哈希抓取两个页面。

你的哈希值将决定这种启发式的强大程度和资源密集程度。您可以散列整个网页内容或其内容/标题的一些组合以及您的抓取工具找到的链接（或者除了其网址之外每个网页都足够独特的其他内容）。显然，包括页面的URL不是一个好主意，因为您现在的问题是这些页面具有不同的URL但内容相同（链接无效）

虽然可以，但您不必为未正确完成的网页实施解决方法。这将是一个无休止的故事。

如何处理网络爬虫中的重复href？

1 个答案: