我正在尝试制作一个扫描网站的网络抓取工具。现在我想搜索最短的路径来查找显示GOAL
的页面,同时查找和指导循环。目前,我认为我的代码能够找到有的节点数(网页数)。这些网页包含可以根据数字计算的表达式行。这些数字是appended
到模板网址的末尾,导致另一个网页。这会一直持续到页面点击DEADEND
或GOAL
我的url_queue
现在的工作原理是我知道初始链接,我从那里开始工作。我打开URL并扫描/评估内容并将新URLS添加到url_list
ex:
url_a,url_b,url_c
我的url_queue
功能会转到url_a
,然后处理新的网址,extend
将其移至url_list
ex:
url_b,url_c,url_a1,url_a2,url_a3...
如果网址已被访问过,则不会将其添加到url_list
目前,我的函数输出找到的节点(网页)的数量以及每个网址的处理和扩展顺序。我不确定如何找到GOAL
页面的最短路径,以及如何计算直接循环次数。
我的代码:
def convert_to_link(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
output_expressions = response.read().splitlines() #return each expression in a list
num_list = []
url_list = []
for expressions in output_expressions:
if expressions == 'DEADEND':
_,_,url_num = url.rpartition('/')
#print 'DEADEND AT ' + str(url_num)
continue
elif expressions == 'GOAL':
_,_,url_num = url.rpartition('/')
#print 'GOAL AT ' + str(url_num)
continue
else:
num_list.append(evaluate(parse(expressions))) # parse and evaluate computes expression
for number in num_list:
url_list.append(newpage_gen(number)) # newpage_gen creates a new url that can be put into browser
return url_list
def url_queue(url_list):
count = 0
visited_count = 0
visited = []
path_list = []
for url in url_list:
_,_,number = url.rpartition('/')
if number in visited:
visited_count += 1 # count number of revisited nodes
print visited_count
path_list.append(number + "PATH") # which node is revisited
continue
else:
visited.append(number) # add which numbers have been visited to visited list
path_list.append(number) # path list?
print visited
new_urls = convert_to_link(url) # visit URL in the list
url_list.extend(new_urls) # extend to the end of the url_list to visit later
count += 1 # +1 for non visited node
print count
return count, visited_count