相当于Python中的wget下载网站和资源

时间:2012-02-10 00:21:34

标签: python web-crawler wget

同样的事情在2。5年前在Downloading a web page and all of its resource files in Python中提出,但没有得到答案,“请看相关主题”并不是真的在问同样的事情。

我想下载页面上的所有内容,以便只从文件中查看它。

命令

  

wget --page-requisites --domains = DOMAIN --no-parent --html-extension --convert-links --restrict-file-names = windows

完全符合我的需要。但是我们希望能够将其与其他必须可移植的东西联系起来,因此需要它在Python中。

我一直在看美丽的汤,scrapy,各种蜘蛛贴在这个地方,但这些都似乎处理以巧妙但具体的方式获取数据/链接。使用这些来做我想做的事情似乎需要大量工作来处理找到所有资源,当我确定必须有一个简单的方法。

非常感谢

2 个答案:

答案 0 :(得分:3)

您应该使用适当的工具来完成手头的工作。

如果你想蜘蛛网站并将页面保存到磁盘,Python可能不是最好的选择。当有人需要该功能时,开源项目会获得功能,并且因为wget能够很好地完成工作,所以没有人会费心去编写python库来替换它。

考虑到几乎任何具有Python解释器的平台都可以运行wget,是否有理由不能使用wget?

答案 1 :(得分:2)

我的同事写了这段代码,我相信很多其他来源拼凑在一起。我们的系统可能有一些特定的怪癖,但它应该有助于任何想要做同样的事情

"""
    Downloads all links from a specified location and saves to machine.
    Downloaded links will only be of a lower level then links specified.
    To use: python downloader.py link
"""
import sys,re,os,urllib2,urllib,urlparse
tocrawl = set([sys.argv[1]])
# linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?')
linkregex = re.compile('href=[\'|"](.*?)[\'"].*?')
linksrc = re.compile('src=[\'|"](.*?)[\'"].*?')
def main():
    link_list = []##create a list of all found links so there are no duplicates
    restrict = sys.argv[1]##used to restrict found links to only have lower level
    link_list.append(restrict)
    parent_folder = restrict.rfind('/', 0, len(restrict)-1)
    ##a.com/b/c/d/ make /d/ as parent folder
    while 1:
        try:
            crawling = tocrawl.pop()
            #print crawling
        except KeyError:
            break
        url = urlparse.urlparse(crawling)##splits url into sections
        try:
            response = urllib2.urlopen(crawling)##try to open the url
        except:
            continue
        msg = response.read()##save source of url
        links = linkregex.findall(msg)##search for all href in source
        links = links + linksrc.findall(msg)##search for all src in source
        for link in (links.pop(0) for _ in xrange(len(links))):
            if link.startswith('/'):
                ##if /xxx a.com/b/c/ -> a.com/b/c/xxx
                link = 'http://' + url[1] + link
            elif ~link.find('#'):
                continue
            elif link.startswith('../'):
                if link.find('../../'):##only use links that are max 1 level above reference
                    ##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html
                    parent_pos = url[2].rfind('/')
                    parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1
                    parent_url = url[2][:parent_pos]
                    new_link = link.find('/')+1
                    link = link[new_link:]
                    link = 'http://' + url[1] + parent_url + link
                else:
                    continue
            elif not link.startswith('http'):
                if url[2].find('.html'):
                    ##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html
                    a = url[2].rfind('/')+1
                    parent = url[2][:a]
                    link = 'http://' + url[1] + parent + link
                else:
                    ##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html
                    link = 'http://' + url[1] + url[2] + link
            if link not in link_list:
                link_list.append(link)##add link to list of already found links
                if (~link.find(restrict)):
                ##only grab links which are below input site
                    print link ##print downloaded link
                    tocrawl.add(link)##add link to pending view links
                    file_name = link[parent_folder+1:]##folder structure for files to be saved
                    filename = file_name.rfind('/')
                    folder = file_name[:filename]##creates folder names
                    folder = os.path.abspath(folder)##creates folder path
                    if not os.path.exists(folder):
                        os.makedirs(folder)##make folder if it does not exist
                    try:
                        urllib.urlretrieve(link, file_name)##download the link
                    except:
                        print "could not download %s"%link
                else:
                    continue
if __name__ == "__main__":
    main()

感谢回复