我正在接受Udacity的计算机科学课程的介绍,对于其中一项作业,我必须编写将从网页上获取所有链接的代码。这是代码
def get_next_target(page):
start_link = page.find('<a href=')
while True:
if start_link == -1:
x, y = None, 0
return x, y
break
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote
当我运行示例时,它似乎有效,但是当我提交我的代码时,我得到的结果是我的提交没有终止。这是什么意思?我的代码有什么问题?
答案 0 :(得分:0)
def get_next_target(page, start=0):
""" function find link in part of page """
start_link = page[start:].find('<a href=')
if start_link == -1:
x, y = None, None
return x, y
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote
def find_all(page):
""" function find all links"""
length = len(page)
current_position = 0 # we start with full page
urls = []
while current_position < length:
# get url and set current_positon, so next we gonna search
# only part of page
url, current_position = get_next_target(page, current_position)
urls.append(url)
if current_position is None:
return urls
return urls
但我建议使用正则表达式 - 比如:
def find_all(page):
import re
return re.findall('<a href="(.+)"', page)
修改强> 但这两种解决方案都不会检测到如下链接:
<a href="some/page">, or <a tilte="ti" href="some/page" >
为此你需要重新创建正则表达式。这是恕我直言的最佳选择。