Question

以下程序会导致无限循环。一旦我读完所有链接，我如何停止继续？谢谢。

def findAllURLs():

    with open('manylinks.html', 'r') as f:
        data = f.read()
        start = data.find('href')
        while(True):
            begin = data.find('"',start)
            end = data.find('"',begin+1)
            print data[begin+1:end]
            start = data.find('href',end + 1)


if __name__ == "__main__":
    findAllURLs()

Answer 1

如果您使用适当的工具来解析HTML，则无需使用while循环。我建议您使用BeautifulSoup 4库来解析文档：

import bs4

def find_all_urls():    
    with open('manylinks.html', 'r') as f:
        soup = bs4.BeautifulSoup(f)

    for i in soup.find_all('a', href=True):
        print(i['href'])

if __name__ == '__main__':
    find_all_urls()

这只会找到<a>个href元素，例如省略<link href=>。如果您也想要link个元素，请使用soup.find_all(href=True)

Answer 2

您应该以这种方式修改代码：

def findAllURLs():

    with open('manylinks.html', 'r') as f:
        data = f.read()
        start = data.find('href')
        while(start != -1):
            begin = data.find('"',start)
            end = data.find('"',begin+1)
            print data[begin+1:end]
            start = data.find('href',end + 1)


if __name__ == "__main__":
    findAllURLs()

实际上，当find无法找到任何其他匹配项时，while将返回-1，从而结束z-index循环。

真正的循环停止无休止

2 个答案: