在Python SimpleHTTPServer中下载整个目录

时间:2010-04-04 05:30:24

标签: python simplehttpserver

我真的很喜欢如何使用SimpleHTTPServer在网络上轻松共享文件,但我希望有一个选项,如“下载整个目录”。是否有一种简单(一种线性)的方式来实现它?

由于

5 个答案:

答案 0 :(得分:5)

查看来源,例如在线here。现在,如果您使用名为目录的URL调用服务器,则会提供其index.html文件,或者缺少该文件,则会调用list_directory方法。大概,你想要用目录的内容(递归地,我想象)制作一个zip文件,并提供服务吗?显然没有办法用单行更改,因为你想要替换现在的68-80行(方法send_head)加上整个方法list_directory,第98-137行 - 这已经至少改变了50多行; - )。

如果你可以改变几十行,而不是一行,并且我所描述的语义是你想要的,你当然可以用cStringIO.StringIO对象构建所需的zip文件。 {3}}类,并在相关目录上用ZipFile填充它(假设你想要递归地获取所有子目录)。但它绝对不会是一个单行; - )。

答案 1 :(得分:4)

没有一个班轮可以做到这一点,你的意思是“下载整个目录”作为焦油或拉链?

无论如何,您可以按照以下步骤进行操作

  1. 从SimpleHTTPRequestHandler派生一个类,或者只是复制其代码
  2. 更改list_directory方法以返回“下载整个文件夹”的链接
  3. 更改copyfile方法,以便为您的链接拉链整个目录并将其返回
  4. 您可以缓存zip,这样您就不会每次都压缩文件夹,而是查看是否有任何文件被修改
  5. 这将是一项有趣的练习:)

答案 2 :(得分:4)

我为你做了那个修改,我不知道是否有更好的方法可以做到这一点,但是:

只需保存文件(例如:ThreadedHTTPServer.py)并访问:

$ python -m /path/to/ThreadedHTTPServer PORT

BPaste Raw Version

修改也以线程方式工作,因此您在下载和导航的同时不会遇到问题,代码没有组织但是:

from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
from SocketServer import ThreadingMixIn
import threading
import SimpleHTTPServer
import sys, os, zipfile

PORT = int(sys.argv[1])

def send_head(self):
    """Common code for GET and HEAD commands.

    This sends the response code and MIME headers.

    Return value is either a file object (which has to be copied
    to the outputfile by the caller unless the command was HEAD,
    and must be closed by the caller under all circumstances), or
    None, in which case the caller has nothing further to do.

    """
    path = self.translate_path(self.path)
    f = None

    if self.path.endswith('?download'):

        tmp_file = "tmp.zip"
        self.path = self.path.replace("?download","")

        zip = zipfile.ZipFile(tmp_file, 'w')
        for root, dirs, files in os.walk(path):
            for file in files:
                if os.path.join(root, file) != os.path.join(root, tmp_file):
                    zip.write(os.path.join(root, file))
        zip.close()
        path = self.translate_path(tmp_file)

    elif os.path.isdir(path):

        if not self.path.endswith('/'):
            # redirect browser - doing basically what apache does
            self.send_response(301)
            self.send_header("Location", self.path + "/")
            self.end_headers()
            return None
        else:

            for index in "index.html", "index.htm":
                index = os.path.join(path, index)
                if os.path.exists(index):
                    path = index
                    break
            else:
                return self.list_directory(path)
    ctype = self.guess_type(path)
    try:
        # Always read in binary mode. Opening files in text mode may cause
        # newline translations, making the actual size of the content
        # transmitted *less* than the content-length!
        f = open(path, 'rb')
    except IOError:
        self.send_error(404, "File not found")
        return None
    self.send_response(200)
    self.send_header("Content-type", ctype)
    fs = os.fstat(f.fileno())
    self.send_header("Content-Length", str(fs[6]))
    self.send_header("Last-Modified", self.date_time_string(fs.st_mtime))
    self.end_headers()
    return f

def list_directory(self, path):

    try:
        from cStringIO import StringIO
    except ImportError:
        from StringIO import StringIO
    import cgi, urllib

    """Helper to produce a directory listing (absent index.html).

    Return value is either a file object, or None (indicating an
    error).  In either case, the headers are sent, making the
    interface the same as for send_head().

    """
    try:
        list = os.listdir(path)
    except os.error:
        self.send_error(404, "No permission to list directory")
        return None
    list.sort(key=lambda a: a.lower())
    f = StringIO()
    displaypath = cgi.escape(urllib.unquote(self.path))
    f.write('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">')
    f.write("<html>\n<title>Directory listing for %s</title>\n" % displaypath)
    f.write("<body>\n<h2>Directory listing for %s</h2>\n" % displaypath)
    f.write("<a href='%s'>%s</a>\n" % (self.path+"?download",'Download Directory Tree as Zip'))
    f.write("<hr>\n<ul>\n")
    for name in list:
        fullname = os.path.join(path, name)
        displayname = linkname = name
        # Append / for directories or @ for symbolic links
        if os.path.isdir(fullname):
            displayname = name + "/"
            linkname = name + "/"
        if os.path.islink(fullname):
            displayname = name + "@"
            # Note: a link to a directory displays with @ and links with /
        f.write('<li><a href="%s">%s</a>\n'
                % (urllib.quote(linkname), cgi.escape(displayname)))
    f.write("</ul>\n<hr>\n</body>\n</html>\n")
    length = f.tell()
    f.seek(0)
    self.send_response(200)
    encoding = sys.getfilesystemencoding()
    self.send_header("Content-type", "text/html; charset=%s" % encoding)
    self.send_header("Content-Length", str(length))
    self.end_headers()
    return f

Handler = SimpleHTTPServer.SimpleHTTPRequestHandler
Handler.send_head = send_head
Handler.list_directory = list_directory

class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
    """Handle requests in a separate thread."""

if __name__ == '__main__':
    server = ThreadedHTTPServer(('0.0.0.0', PORT), Handler)
    print 'Starting server, use <Ctrl-C> to stop'
    server.serve_forever()

答案 3 :(得分:1)

我喜欢@mononoke 的解决方案。但其中有几个问题。他们是

  • 以文本模式写入文件
  • 有时hreftext是不同的,特别是对于非ASCII路径
  • 不要分块下载大文件

我试图解决这些问题:

import os
from pathlib import Path
from urllib.parse import urlparse, urljoin
import requests
from bs4 import BeautifulSoup
import math

def get_links(content):
    soup = BeautifulSoup(content)
    for a in soup.findAll('a'):
        yield a.get('href'), a.get_text()

def download(url, path=None, overwrite=False):
    if path is None:
        path = urlparse(url).path.lstrip('/')
    if url.endswith('/'):
        r = requests.get(url)
        if r.status_code != 200:
            raise Exception('status code is {} for {}'.format(r.status_code, url))
        content = r.text
        Path(path.rstrip('/')).mkdir(parents=True, exist_ok=True)
        for link, name in get_links(content):
            if not link.startswith('.'): # skip hidden files such as .DS_Store
                download(urljoin(url, link), os.path.join(path, name))
    else:
        if os.path.isfile(path):
            print("#existing", path)
            if not overwrite:
                return
        chunk_size = 1024*1024
        r = requests.get(url, stream=True)
        content_size = int(r.headers['content-length'])
        total = math.ceil(content_size / chunk_size)
        print("#", path)
        with open(path, 'wb') as f:
            c = 0
            st = 100
            for chunk in r.iter_content(chunk_size=chunk_size):
                c += 1
                if chunk:
                    f.write(chunk)
                ap = int(c*st/total) - int((c-1)*st/total)
                if ap > 0:
                    print("#" * ap, end="")
            print("\r  "," "*int(c*st/total), "\r", end="")
            
if __name__ == '__main__':
    # the trailing / indicates a folder
    url = 'http://ed470d37.ngrok.io/a/bc/'
    download(url, "/data/bc")

答案 4 :(得分:0)

没有简单的方法。

一种替代方法是使用下面的python脚本以递归方式下载整个文件夹。这对于Python 3来说效果很好。根据需要更改URL。

import os
from pathlib import Path
from urllib.parse import urlparse, urljoin
import requests
from bs4 import BeautifulSoup

def get_links(content):
    soup = BeautifulSoup(content)
    for a in soup.findAll('a'):
        yield a.get('href')

def download(url):
    path = urlparse(url).path.lstrip('/')
    print(path)
    r = requests.get(url)
    if r.status_code != 200:
        raise Exception('status code is {} for {}'.format(r.status_code, url))
    content = r.text
    if path.endswith('/'):
        Path(path.rstrip('/')).mkdir(parents=True, exist_ok=True)
        for link in get_links(content):
            if not link.startswith('.'): # skip hidden files such as .DS_Store
                download(urljoin(url, link))
    else:
        with open(path, 'w') as f:
            f.write(content)


if __name__ == '__main__':
    # the trailing / indicates a folder
    url = 'http://ed470d37.ngrok.io/a/bc/'
    download(url)