查找Web爬网程序中的循环和最短路径

时间:2013-12-02 01:53:17

标签: python

我正在尝试制作一个扫描网站的网络抓取工具。现在我想搜索最短的路径来查找显示GOAL的页面,同时查找和指导循环。目前,我认为我的代码能够找到有的节点数(网页数)。这些网页包含可以根据数字计算的表达式行。这些数字是appended到模板网址的末尾,导致另一个网页。这会一直持续到页面点击DEADENDGOAL

我的url_queue现在的工作原理是我知道初始链接,我从那里开始工作。我打开URL并扫描/评估内容并将新URLS添加到url_list

ex:

url_a,url_b,url_c

我的url_queue功能会转到url_a,然后处理新的网址,extend将其移至url_list

ex:

url_b,url_c,url_a1,url_a2,url_a3...

如果网址已被访问过,则不会将其添加到url_list

目前,我的函数输出找到的节点(网页)的数量以及每个网址的处理和扩展顺序。我不确定如何找到GOAL页面的最短路径,以及如何计算直接循环次数。

我的代码:

def convert_to_link(url):      
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    output_expressions = response.read().splitlines()   #return each expression in a list
    num_list = []
    url_list = []
    for expressions in output_expressions:
        if expressions == 'DEADEND':
            _,_,url_num = url.rpartition('/')
            #print 'DEADEND AT ' + str(url_num)
            continue
        elif expressions == 'GOAL':
            _,_,url_num = url.rpartition('/')
            #print 'GOAL AT ' + str(url_num)
            continue
        else:
            num_list.append(evaluate(parse(expressions)))   # parse and evaluate computes expression
    for number in num_list:
        url_list.append(newpage_gen(number))    # newpage_gen creates a new url that can be put into browser
    return url_list

def url_queue(url_list):
    count = 0
    visited_count = 0
    visited = []
    path_list = []
    for url in url_list:
        _,_,number = url.rpartition('/')
        if number in visited:
            visited_count += 1      # count number of revisited nodes
            print visited_count
            path_list.append(number + "PATH")   # which node is revisited
            continue
        else:
            visited.append(number)          # add which numbers have been visited to visited list
            path_list.append(number)        # path list?
            print visited
            new_urls = convert_to_link(url) # visit URL in the list 
            url_list.extend(new_urls)       # extend to the end of the url_list to visit later
            count += 1                      # +1 for non visited node
            print count
    return count, visited_count

0 个答案:

没有答案