Question

我正在编写一个Python命令行实用程序，它涉及将字符串转换为TextBlob，它是自然语言处理模块的一部分。导入模块非常慢，在我的系统上约300毫秒。为了快速，我创建了一个memoized函数，仅在第一次调用函数时将文本转换为TextBlob。重要的是，如果我在同一文本上运行我的脚本两次，我想避免重新导入TextBlob并重新计算blob，而是将其从缓存中拉出来。这一切都完成并且工作正常，除了由于某种原因，功能仍然很慢。事实上，它和以前一样慢。我认为这必须是因为即使函数被记忆并且import语句发生在memoized函数内部，模块也会被重新导入。

这里的目标是修复以下代码，以便memoized运行尽可能快，因为结果不需要重新计算。

以下是核心代码的最小示例：

@memoize
def make_blob(text):
     import textblob
     return textblob.TextBlob(text)


if __name__ == '__main__':
    make_blob("hello")

这是memoization装饰器：

import os
import shelve
import functools
import inspect


def memoize(f):
    """Cache results of computations on disk in a directory called 'cache'."""
    path_of_this_file = os.path.dirname(os.path.realpath(__file__))
    cache_dirname = os.path.join(path_of_this_file, "cache")

    if not os.path.isdir(cache_dirname):
        os.mkdir(cache_dirname)

    cache_filename = f.__module__ + "." + f.__name__
    cachepath = os.path.join(cache_dirname, cache_filename)

    try:
        cache = shelve.open(cachepath, protocol=2)
    except:
        print 'Could not open cache file %s, maybe name collision' % cachepath
        cache = None

    @functools.wraps(f)
    def wrapped(*args, **kwargs):
        argdict = {}

        # handle instance methods
        if hasattr(f, '__self__'):
            args = args[1:]

        tempargdict = inspect.getcallargs(f, *args, **kwargs)

        for k, v in tempargdict.iteritems():
            argdict[k] = v

        key = str(hash(frozenset(argdict.items())))

        try:
            return cache[key]
        except KeyError:
            value = f(*args, **kwargs)
            cache[key] = value
            cache.sync()
            return value
        except TypeError:
            call_to = f.__module__ + '.' + f.__name__
            print ['Warning: could not disk cache call to ',
                   '%s; it probably has unhashable args'] % (call_to)
            return f(*args, **kwargs)

    return wrapped

这里有一个演示，即备忘录目前不会保存：

❯ time python test.py
python test.py  0.33s user 0.11s system 100% cpu 0.437 total

~/Desktop
❯ time python test.py
python test.py  0.33s user 0.11s system 100% cpu 0.436 total

即使正确记忆该函数，也会发生这种情况（放置在memoized函数内的print语句仅在第一次运行脚本时提供输出）。

我已将所有内容整合到一个GitHub Gist中，以防它有用。

Answer 1

不同的方法怎么样：

import pickle

CACHE_FILE = 'cache.pkl'

try:
    with open(CACHE_FILE) as pkl:
        obj = pickle.load(pkl)
except:
    import slowmodule
    obj = "something"
    with open(CACHE_FILE, 'w') as pkl:
        pickle.dump(obj, pkl)

print obj

这里我们缓存对象，而不是模块。请注意，如果您的缓存所需的对象需要slowmodule，则此不会为您节省费用。因此，在上面的示例中，您会看到节省，因为"something"是一个字符串，并且不需要slowmodule模块来理解它。但是，如果你做了像

这样的事情

obj = slowmodule.Foo("bar")

unpickling进程会自动导入slowmodule，否定缓存的任何好处。

因此，如果您可以将textblob.TextBlob(text)操作为某些内容，那么当unpickled不需要textblob模块时，您将看到使用此方法的节省。

模块似乎在memoized Python函数中重新导入

1 个答案: