Question

我正在尝试通过递归从给定文本中提取所有链接。我遇到的问题是我想在列表中存储链接，无论出于何种原因，调用append都会导致我的代码崩溃。

def findLink(text, start, *links):
    linkStart = text.find('http', start);
    if linkStart == -1:
        return

    linkEnd = text.find('">', linkStart);
    url = text[linkStart:linkEnd];
    links.append(url);
    findLink(text, linkEnd + 2, links);


source = '''<html xmlns="http://www.w3.org/1999/xhtml">
          <head>
          <title>Udacity</title>
          </head>
          <body>
          <h1>Udacity</h1>
          <p><b>Udacity</b> is a private institution of
          <a href="http://www.wikipedia.org/wiki/Higher_education">higher education founded by</a> <a href="http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".</p>   
          <p> It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="http://www.wikipedia.org/wiki/Digital_Life_Design">Digital Life Design</a> conference.</p>      
          </body>
          </html>'''

links = list();
findLink(source, 0, links);

for link in links:
    print(link);

Answer 1

首先，两个一般性评论：

您不需要在行尾添加分号。
Don't parse HTML with regular expressions。 Python在标准库中有convenient xml parser。

现在，关于你的问题。当你最后用varargs写一个函数时，就像f(a, b, *c)一样，Python使c成为一个元组。元组是不可变的，因此它们没有append()方法。因此，您可以将其转换为list，然后使用append()，或转到（半）纯粹并写入links = links + (url,)。

此外，稍后调用递归函数的方式也不正确。你需要写

findLink(text, linkEnd + 2, *links)

将links作为varargs传递（将同时用于列表和元组）。话虽如此，没有理由这样传递它，因为在大量的HTML上会导致很多参数传递给函数，而我不确定Python会如何处理它。只需将其作为列表或元组正常传递。

＆＃39;元组＆＃39;对象没有属性＆＃39;追加＆＃39;

1 个答案: