Question

我正在编写一个网络抓取工具。当它访问页面时，它会拉出该页面上的所有链接（满足条件，等等）并将它们添加到要访问的页面队列中。我不希望抓取工具两次访问同一页面。我当前的解决方案很笨重：当访问页面时，我将URL添加到访问页面列表中（因此从一个队列移动到另一个列表）。然后，当我去访问下一页时，我递归地“弹出”队列中的链接，直到我得到一个不在先前访问过的页面列表中的链接。就像我说的那样，这看起来很笨拙，效率低下，必须有更好的方法。

这是我从队列中返回第一个未访问页面的代码：

def first_new_page(queue, visited): 
    ''' 
    Given a queue and list of visited pages, returns the first unvisited URL in the queue 
    '''
    if queue.empty(): 
        return -1 
    rv = queue.get()
    if rv not in visited: 
        return rv 
    else: 
        return first_new_page(queue, visited)

Answer 1

您只需使用设置（）。

更新

确定之前我并没有真正给你一个解决方案但你应该如何使用 set（）而不是弹出你的列表，为了完整起见这就是你所追求的：

visited = set()

queue = ['www.google.com', 'www.yahoo.com', 'www.microsfot.com']

def crawl_the_page(link):
    # ...crawling...
    visited.add(link)
    return


# you just for through the queue list
# no need to pop the list, use the set() to compare instead
for url in queue:
    if url not in visited:
        #... do your stuff ...
        #... crawl your pages ...
        crawl_the_page(url)

队列数据结构 - 查找队列中先前未排队的第一个元素

1 个答案:

更新