Question

我正在编写一个需要持久本地访问通过http获取的大文件的应用程序。我希望保存在本地目录（某种部分镜像）中的文件，以便后续执行应用程序只是注意到URL已经在本地镜像，以便其他程序可以使用它们。

理想情况下，它还可以保留时间戳或etag信息，并能够使用If-Modified-Since或If-None-Match标头快速发出http请求，以检查新版本，但除非文件有效，否则请避免完整下载已经更新。但是，由于这些网页很少发生变化，我可能会遇到过时副本中的错误，并且在适当的时候找到其他方法从缓存中删除文件。

环顾四周，我看到urllib.request.urlretrieve可以保存缓存副本，但看起来它无法处理我的If-Modified-Since缓存更新目标。

请求模块似乎是最新的和最好的，但它似乎并不适合这种情况。有一个CacheControl附加模块，它支持我的缓存更新目标，因为它执行完整的HTTP缓存。但它似乎并没有以可直接用于其他（非python）程序的方式存储获取的文件，因为FileCache将资源存储为pickle数据。而can python-requests fetch url directly to file handle on disk like curl? - Stack Overflow的讨论表明，保存到本地文件可以使用一些额外的代码完成，但这似乎与CacheControl模块没有很好的混合。

那么有一个网络抓取库可以满足我的需求吗？这基本上可以保留过去提取的文件的镜像（并告诉我文件名是什么），而不必明确地管理所有文件？

Answer 1

我有相同的要求并找到requests-cache。添加它非常容易，因为它扩展了requests。您可以将缓存放在内存中，并在脚本结束后消失，或者使用sqlite，mongodb或redis使其保持持久性。以下是我写的两行，并按宣传方式工作：

import requests, requests_cache
requests_cache.install_cache('scraper_cache', backend='sqlite', expire_after=3600)

Answer 2

我认为没有一个库可以做到这一点，但实现起来并不难。这里有一些我正在使用请求的函数，可能会帮助你：

import os
import os.path as op
import requests
import urlparse

CACHE = 'path/to/cache'

def _file_from_url(url):
    return op.basename(urlparse.urlsplit(url).path)


def is_cached(*args, **kwargs):
    url = kwargs.get('url', args[0] if args else None)
    path = op.join(CACHE, _file_from_url(url))

    if not op.isfile(path):
        return False

    res = requests.head(*args, **kwargs)

    if not res.ok:
        return False

    # Check if cache is stale. For me, checking content-length fitted my use case.
    # You can use modification date or etag here:

    if not 'content-length' in res.headers:
        return False

    return op.getsize(path) == int(res.headers['content-length'])


def update_cache(*args, **kwargs):
    url = kwargs.get('url', args[0] if args else None)
    path = op.join(CACHE, _file_from_url(url))

    res = requests.get(*args, **kwargs)

    if res.ok:
        with open(path, 'wb') as handle:
            for block in res.iter_content(1024):
                if block:
                    handle.write(block)
                    handle.flush()

用法：

if not is_cached('http://www.google.com/humans.txt'):
    update_cache('http://www.google.com/humans.txt')

# Do something with cache/humans.txt

在Python中维护网页的更新文件缓存？

2 个答案: