在python中快速/优化N-gram实现

时间:2014-02-19 14:16:02

标签: python nlp nltk information-retrieval n-gram

python中哪个ngram实现最快?

我试图描述nltk与vs scott的zip(http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/):

from nltk.util import ngrams as nltkngram
import this, time

def zipngram(text,n=2):
  return zip(*[text.split()[i:] for i in range(n)])

text = this.s

start = time.time()
nltkngram(text.split(), n=2)
print time.time() - start

start = time.time()
zipngram(text, n=2)
print time.time() - start

[OUT]

0.000213146209717
6.50882720947e-05

在python中生成ngrams是否有更快的实现?

3 个答案:

答案 0 :(得分:8)

进行一些分析的一些尝试。我认为使用发电机可以提高速度。但与原版的略微修改相比,这种改进并不明显。但是如果你不需要同时使用完整列表,那么生成器函数应该更快。

import timeit
from itertools import tee, izip, islice

def isplit(source, sep):
    sepsize = len(sep)
    start = 0
    while True:
        idx = source.find(sep, start)
        if idx == -1:
            yield source[start:]
            return
        yield source[start:idx]
        start = idx + sepsize

def pairwise(iterable, n=2):
    return izip(*(islice(it, pos, None) for pos, it in enumerate(tee(iterable, n))))

def zipngram(text, n=2):
    return zip(*[text.split()[i:] for i in range(n)])

def zipngram2(text, n=2):
    words = text.split()
    return pairwise(words, n)


def zipngram3(text, n=2):
    words = text.split()
    return zip(*[words[i:] for i in range(n)])

def zipngram4(text, n=2):
    words = isplit(text, ' ')
    return pairwise(words, n)


s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
s = s * 10 ** 3

res = []
for n in range(15):

    a = timeit.timeit('zipngram(s, n)', 'from __main__ import zipngram, s, n', number=100)
    b = timeit.timeit('list(zipngram2(s, n))', 'from __main__ import zipngram2, s, n', number=100)
    c = timeit.timeit('zipngram3(s, n)', 'from __main__ import zipngram3, s, n', number=100)
    d = timeit.timeit('list(zipngram4(s, n))', 'from __main__ import zipngram4, s, n', number=100)

    res.append((a, b, c, d))

a, b, c, d = zip(*res)

import matplotlib.pyplot as plt

plt.plot(a, label="zipngram")
plt.plot(b, label="zipngram2")
plt.plot(c, label="zipngram3")
plt.plot(d, label="zipngram4")
plt.legend(loc=0)
plt.show()

对于这个测试数据,zipngram2和zipngram3似乎是最快的。

enter image description here

答案 1 :(得分:2)

扩展M4rtini's Code

使用Python3.6.5,nltk == 3.3

from nltk import ngrams
def get_n_gramlist(text,n=2):        
    nngramlist=[]
    for s in ngrams(text.split(),n=n):        
        nngramlist.append(s)                
    return nngramlist

时间结果 enter image description here

答案 2 :(得分:1)

扩展M4rtini's code,我使用硬编码n=2参数创建了另外三个版本:

def bigram1(text):
    words = iter(text.split())
    last = words.next()
    for piece in words:
        yield (last, piece)
        last = piece

def bigram2(text):
    words = text.split()
    return zip(words, islice(words, 1, None))

def bigram3(text):
    words = text.split()
    return izip(words, islice(words, 1, None))

使用timeit,我得到了以下结果:

zipngram(s, 2):        3.854871988296509
list(zipngram2(s, 2)): 2.0733611583709717
zipngram3(s, 2):       2.6574149131774902
list(zipngram4(s, 2)): 4.668303966522217
list(bigram1(s)):      2.2748169898986816
bigram2(s):            1.979405164718628
list(bigram3(s)):      1.891601800918579

bigram3是我测试中最快的。硬编码和使用迭代器似乎确实有一些好处,如果它们在整个过程中使用(至少对于这个参数值)。我们看到迭代器的好处贯穿于zipngram2 zipngram3n=2之间的较大差异。

我也尝试过使用PyPy,但实际上这似乎让事情变慢了(这包括尝试在进行定时测试之前将函数调用10k次来预热JIT)。不过,我对PyPy很新,所以我可能做错了。可能使用Pyrex或Cython可以实现更快的加速。