Question

如何使用Python标准库来获取文件对象，默默地确保它与其他位置保持同步？

我正在处理的程序需要在本地访问一组文件;他们＆＃39;再只是普通文件。

但是这些文件是远程可用文档的本地缓存副本网址 - 每个文件都有该文件内容的规范网址。

（我在这里写的是关于HTTP URL，但我正在寻找一种不特定于任何特定远程提取协议的解决方案。）

我喜欢'get_file_from_cache'的API，它看起来像：

file_urls = {
        "/path/to/foo.txt": "http://example.org/spam/",
        "other/path/bar.data": "https://example.net/beans/flonk.xml",
        }

for (filename, url) in file_urls.items():
    infile = get_file_from_cache(filename, canonical=url)
    do_stuff_with(infile.read())

如果本地文件的修改时间戳不显着早于文档的Last-Modified时间戳相应的URL，get_file_from_cache只返回文件对象无需更改文件。
本地文件可能已过期（其修改时间戳可能是远远超过Last-Modified时间戳相应的URL）。在这种情况下，get_file_from_cache应该首先将文档的内容读入文件，然后返回文件对象
本地文件可能尚不存在。在这种情况下，get_file_from_cache 应该首先从相应的URL读取文档内容，创建本地文件，然后返回文件对象。
由于某种原因，远程网址可能无法使用。在这种情况下， get_file_from_cache应该只返回文件对象，或者如果那样无法完成，引发错误。

所以这类似于HTTP对象缓存。除了那些通常以URL为中心，本地文件是隐藏的实现详细信息，我想要一个 API，专注于本地文件与远程请求隐藏的实施细节。

在Python库中存在这样的事情，或者是简单的代码使用它？有或没有HTTP和URL的细节，是否有一些通用缓存配方已经使用标准库实现？

此本地文件缓存（忽略URL和网络访问的特定）看起来就像是容易出错的那种东西无数种方式，所以应该有一个明显的实现可用。

我好运吗？你有什么建议吗？

Answer 1

通过快速谷歌搜索，我无法找到可以做到的现有图书馆，但如果没有这样的事情，我会感到惊讶。：）

无论如何，这是使用流行的Requests模块进行此操作的一种方法。但是，使用urllib / urlib2可以很容易地调整此代码。

#! /usr/bin/env python

''' Download a file if it doesn't yet exist in offline cache, or if the online
    version is more than age seconds newer than the cached version.

    Example code for
http://stackoverflow.com/questions/26436641/access-a-local-file-but-ensure-it-is-up-to-date

    Written by PM 2Ring 2014.10.18
'''

import sys
import os
import email.utils
import requests


cache_path = 'offline_cache'

#Translate local file names in cache_path to URLs
file_urls = {
    'example1.html': 'http://www.example.com/',
    'badfile': 'http://httpbin.org/status/404',
    'example2.html': 'http://www.example.org/index.html',
}


def get_headers(url):
    resp = requests.head(url)
    print "Status: %d" % resp.status_code
    resp.raise_for_status()
    for k,v in resp.headers.items():
        print '%-16s : %s' % (k, v)


def get_url_mtime(url):
    ''' Get last modified time of an online file from the headers
    and convert to a timestamp
    '''
    resp = requests.head(url)
    resp.raise_for_status()
    t = email.utils.parsedate_tz(resp.headers['last-modified'])
    return email.utils.mktime_tz(t)


def download(url, fname):
    ''' Download url to fname, setting mtime of file to match url '''
    print >>sys.stderr, "Downloading '%s' to '%s'" % (url, fname)
    resp = requests.get(url)
    #print "Status: %d" % resp.status_code
    resp.raise_for_status()

    t = email.utils.parsedate_tz(resp.headers['last-modified'])
    timestamp = email.utils.mktime_tz(t)
    #print 'last-modified', timestamp

    with open(fname, 'wb') as f:
        f.write(resp.content)
    os.utime(fname, (timestamp, timestamp))


def open_cached(basename, mode='r', age=0):
    ''' Open a cached file.

    Download it if it doesn't yet exist in cache, or if the online
    version is more than age seconds newer than the cached version.'''

    fname = os.path.join(cache_path, basename)
    url = file_urls[basename]
    #print fname, url

    if os.path.exists(fname):
        #Check if online version is sufficiently newer than offline version
        file_mtime = os.path.getmtime(fname)
        url_mtime = get_url_mtime(url)
        if url_mtime > age + file_mtime:
            download(url, fname)
    else:
        download(url, fname)

    return open(fname, mode)


def main():
    for fname in ('example1.html', 'badfile', 'example2.html'):
        print fname
        try:
            with open_cached(fname, 'r') as f:
                for i, line in enumerate(f, 1):
                    print "%3d: %s" % (i, line.rstrip())
        except requests.exceptions.HTTPError, e:
            print >>sys.stderr, "%s '%s' = '%s'" % (e, file_urls[fname], fname)
        print


if __name__ == "__main__":
    main()

当然，对于实际使用，您应该添加一些正确的错误检查。

您可能会注意到我已经定义了一个永远不会被调用的函数get_headers(url);我在开发过程中使用过它认为在扩展这个程序时可能会派上用场，所以我把它留在了。:)

访问本地文件，但确保它是最新的

1 个答案: