Question

我需要为作业提供一些帮助。我需要建立一个简单的＆＃34; （根据我的老师的说法）web-scraper以URL作为参数，搜索该URL的源代码，然后返回该源代码中的链接（href之后的任何内容）。我老师让我们使用的示例网址是http://citstudent.lanecc.net/tools.shtml。执行程序时，应返回10个链接以及网站的URL。

由于我仍然试图围绕这些概念，我不知道从哪里开始，所以我转向堆栈溢出，我找到了一个工作的脚本。它做我想做的事，但不满足所有要求：

import urllib2
url = "http://citstudent.lanecc.net/tools.shtml"
page = urllib2.urlopen(url)
data = page.read().split("</a>")
tag = "<a href=\""
endtag = "\">"
for item in data:
    if "<a href" in item:
            try:
                    ind = item.index(tag)
                    item = item[ind+len(tag):]
                    end = item.index(endtag)
            except: pass
            else:
                     print item[:end]

这是有效的，因为我将URL硬编码到我的代码中，并且因为它在一些href标签之后打印。通常我会说，只是引导我完成这个，而不仅仅是给我代码，但是我有这么糟糕的一天，任何解释或示例都必须比我们在课堂上看到的更好。谢谢。

如何使用urllib2，regex和sys读取特定的URL，然后返回我的搜索和结果？

0 个答案: