用蟒蛇萃取有效1-5克

时间:2014-10-13 13:45:03

标签: python nlp nltk information-retrieval n-gram

我有3,000,000行的大文件,每行有20-40个单词。我必须从语料库中提取1到5个ngrams。我的输入文件是标记化的纯文本,例如:

This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .

目前,我的工作如下,但这似乎不是提取1-5克的有效方法:

#!/usr/bin/env python -*- coding: utf-8 -*-

import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')

with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
    # Extract words from file. 
    src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
    del src_words[-1] # Removes the final '<s>'
    trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
    del trg_words[-1] # Removes the final '<s>'

    # Unigrams count.
    src_unigrams = Counter(src_words) 
    trg_unigrams = Counter(trg_words) 
    # Sum of unigram counts.
    src_sum_unigrams = sum(src_unigrams.values())
    trg_sum_unigrams = sum(trg_unigrams.values())

    # Bigrams count.
    src_bigrams = Counter(zip(src_words,src_words[1:]))
    trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
    # Sum of bigram counts.
    src_sum_bigrams = sum(src_bigrams.values())
    trg_sum_bigrams = sum(trg_bigrams.values())

    # Trigrams count.
    src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
    trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
    # Sum of trigram counts.
    src_sum_trigrams = sum(src_bigrams.values())
    trg_sum_trigrams = sum(trg_bigrams.values())

还有其他方法可以更有效地执行此操作吗?

如何同时最佳地提取不同的N图?

来自Fast/Optimize N-gram implementations in python,基本上是这样的:

zip(*[words[i:] for i in range(n)])

当硬件编码为bigrams时,n=2

zip(src_words,src_words[1:])

这是三元组n=3

zip(src_words,src_words[1:],src_words[2:])

3 个答案:

答案 0 :(得分:7)

如果您只对最常见的(频繁的)n - 克感兴趣(我认为这是你的情况),你可以重用Apriori algorithm的核心思想。给定s_min,一个最小支持,可以被认为是包含给定n - 克的行数,它有效地搜索所有这样的n - 克。

这个想法如下:写一个查询函数,它接受n - 克并测试它在语料库中包含的次数。在准备好这样的函数之后(可以如稍后讨论的那样进行优化),扫描整个语料库并获得所有1 - 克,即裸标记,并选择至少包含s_min次的那些。这为您提供频繁F1 - 克的子集1。然后通过组合来自2的所有1 - 克来测试所有可能的F1 - 克。再次,选择那些符合s_min条件且您将获得F2的条件。通过组合2中的所有F2 - 克并选择频繁的3 - 克,您将得到F3。只要Fn非空,就重复一遍。

这里可以做很多优化。合并来自n的{​​{1}} - 克时,您可以利用Fn - 克nx只能组合成y的事实} -gram iff (n+1)(如果使用了适当的散列,可以在任何x[1:] == y[:-1]的常量时间内检查)。此外,如果你有足够的RAM(对于你的语料库,很多GB),你可以极大地加快查询功能。对于每个n - gram,存储包含给定1 - gram的行索引的哈希集。将两个1 - 克组合成n - gram时,使用两个相应集合的交集,获取可以包含(n+1) - gram的一组行。

随着(n+1)减少,时间复杂度增加。美丽的是,在算法运行时,不经常(并因此无趣)s_min - 克被完全过滤,只为常用的算法节省了计算时间。

答案 1 :(得分:2)

我给你一堆关于你要解决的一般问题的指示。这些中的一个或多个应该对你有用并帮助你解决这个问题。

对于你正在做的事情(我猜某种机器翻译实验)你真的不需要同时将两个文件srcfin和trgfin加载到内存中(至少不是你提供的代码示例)对于在给定时间内需要保存在内存中的内容量,单独处理它们会更便宜。

您正在将大量数据读入内存,处理它(需要更多内存),然后将结果保存在一些内存中的数据结构中。而不是这样做,你应该努力变得更加懒惰。了解python生成器并编写一个生成器,该生成器从给定文本中流出所有ngrams,而无需在任何给定时间点将整个文本保存在内存中。在编写时,itertools python包可能会派上用场。

除了一点之外,将所有这些数据保存在内存中将不再可行。您应该考虑使用map-reduce来帮助您解决这个问题。查看mrjob python包,它允许你在python中编写map reduce作业。在映射器步骤中,您将文本分解为其ngrams,在reducer阶段,您将计算每个ngram获取其总计数的次数。 mrjob也可以在本地运行,这显然不会给你任何并行化的好处,但是会很好,因为mrjob仍会为你做很多繁重的工作。

如果你被迫同时在内存中保存所有计数(对于大量文本),那么要么实施一些修剪策略来修剪非常罕见的ngrams,要么考虑使用一些基于文件的持久查找表这样的sqlite可以为你保存所有数据。

答案 2 :(得分:2)

假设你不想在行之间计算ngrams,并假设天真的标记化:

def ngrams(n, f):
    deque = collections.deque(maxlen=n)
    for line in f:
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1]) # pre-seed so 5-gram counter doesn't count incomplete 5-grams
        for word in words[n-1:]:
            deque.append(word)
            yield tuple(str(w) for w in deque) # n-gram tokenization
counters = [collections.Counter(ngrams(i, open('somefile.txt'))) for i in range(5)]

编辑:添加开头/结束行标记

我认为结果数据对象尽可能稀疏。 3米线,40字是〜120米代币。用英语约100万字(虽然不太常用),你可能会得到一个相当长的尾巴。如果您可以想象您的数据是可交换的/ iid,那么您可以在中间添加一些修剪:

def ngrams(n, f, prune_after=10000):
    counter = collections.Counter()
    deque = collections.deque(maxlen=n)
    for i, line in enumerate(f):
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1])
        for word in words[n-1:]:
            deque.append(word)
            ngram = tuple(str(w) for w in deque)
            if i < prune_after or ngram in counter:
                counter[ngram] += 1
    return counter

放宽可交换性假设需要像Tregoreg那样有效修剪的答案,但在大多数情况下,交换性应该保持不变。

就原始速度而言,我认为zip(如原始代码)vs deque是根本问题。 zip删除最里面的循环,因此它可能已经非常快。 deque需要最内层的循环,但也会迭代地消耗数据,因此其工作内存占用空间应该小得多。哪个更好可能取决于您的机器,但我想象对于大型机器/小型数据,zip会更快。一旦你开始耗尽内存(特别是如果你开始谈论修剪),deque会有更多的优势。