Question

我正在接受Udacity的计算机科学课程的介绍，对于其中一项作业，我必须编写将从网页上获取所有链接的代码。这是代码

def get_next_target(page):
    start_link = page.find('<a href=')
    while True:
        if start_link == -1:
            x, y = None, 0
            return x, y
            break
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote

当我运行示例时，它似乎有效，但是当我提交我的代码时，我得到的结果是我的提交没有终止。这是什么意思？我的代码有什么问题？

Answer 1

def get_next_target(page, start=0):
    """ function find link in part of page """
    start_link = page[start:].find('<a href=')
    if start_link == -1:
        x, y = None, None
        return x, y
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote

def find_all(page):
    """ function find all links"""
    length = len(page)
    current_position = 0  # we start with full page
    urls = []
    while current_position < length:
        # get url and set current_positon, so next we gonna search 
        # only part of page
        url, current_position = get_next_target(page, current_position)
        urls.append(url)
        if current_position is None:
            return urls
    return urls

但我建议使用正则表达式 - 比如：

def find_all(page):
    import re
    return re.findall('<a href="(.+)"', page)

修改但这两种解决方案都不会检测到如下链接：

<a href="some/page">, or <a tilte="ti" href="some/page" >

为此你需要重新创建正则表达式。这是恕我直言的最佳选择。

Python代码问题

1 个答案: