Question

我将如何从单个网页开始，让我们在DMOZ.org的根目录下说，并索引附加到它的每个网址。然后将这些链接存储在文本文件中。我不想要内容，只需要链接本身。一个例子很棒。

Answer 1

例如，这会在this very related (but poorly named) question上打印出链接：

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('https://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

输出：

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...

Answer 2

如果您坚持重新发明轮子，请使用像BeautifulSoup这样的html解析器来抓取所有标签。 This answer对类似的问题是相关的。

Answer 3

Scrapy是一个用于网页抓取的Python框架。这里有很多例子：http://snippets.scrapy.org/popular/bookmarked/

我如何在Python中创建一个简单的URL提取器？

3 个答案: