Question

我需要规范化庞大语料库中的所有单词。有任何想法如何优化此代码？那太慢了......

texts = [ [ list(morph.normalize(word.upper()))[0] for word in document.split() ]
            for document in documents ]

documents是一个字符串列表，其中每个字符串都是单本书的文字。

morph.normalize仅适用于高位寄存器，因此我将.upper（）应用于所有单词。此外，它返回一个带有一个元素的集合，它是规范化的单词（字符串）

Answer 1

我要做的第一个显而易见的事情是将标准化的单词缓存在本地dict中，以避免为给定的单词多次调用morph.normalize()。

第二个优化是将方法别名为局部变量 - 这避免了在循环的每个回合中通过整个属性查找+函数描述符调用+方法对象实例化。

然后，因为它是一个巨大的＆＃34;语料库你可能想要避免一次创建一个完整的列表列表，这可能会占用你所有的ram，让你的计算机开始交换（这保证让它变慢蜗牛）并最终因内存错误而崩溃。我不知道你应该对这个列表列表做什么，也不知道每个文档有多大，但作为一个例子，我在每个文档的结果上把它写到stdout - 真正应该做什么取决于上下文和具体的用例。

注意：显然，未经测试的代码，但至少这应该让你开始

 def iterdocs(documents, morph):
    # keep trac of already normalized words
    # beware this dict might get too big if you
    # have lot of different words. Depending on
    # your corpus, you may want to either use a LRU 
    # cache instead and/or use a per-document cache
    # and/or any other appropriate caching strategy...
    cache = {} 

    # aliasing methods as local variables 
    # is faster for tight loops
    normalize = morph.normalize 

    def norm(word):
        upw = word.upper()
        if upw in cache:
            return cache[upw]
        nw = cache[upw] = normalize(upw).pop()
        return nw

    for doc in documents:
        words = [norm(word) for word in document.split() if word]
        yield words

for text in iterdocs(docs, morph):
    # if you need all the texts for further use 
    # at least write them to disk or other persistence
    # mean and re-read them when needed.
    # Here I just write them to sys.stdout as an example
    print(text)

另外，我不知道从哪里获取文档，但如果它们是文本文件，您可能希望避免将它们全部加载到内存中。只需逐个阅读它们，如果它们本身很大，甚至不能一次读取整个文件（你可以逐行遍历文件 - 文本最明显的选择）。

最后，一旦确定您的代码不会占用单个文档的大量内存，下一个明显的优化就是并行化 - 按可用内核运行一个进程并在进程之间拆分语料库（每个进程写入＆＃ 39; s结果到一个给定的地方）。如果你需要一次性的话，你只需总结一下结果......

哦，是的：如果仍然不够，你可能想用一些地图缩小框架来分发你的工作 - 你的问题看起来非常适合地图缩小。

规范化文档中的所有单词

1 个答案: