更快速地存储NLTK FreqDict?

时间:2016-03-01 03:06:54

标签: python json serialization nltk pickle

我正在尝试加速我的应用程序,我发现下面的简单小函数(compute_ave_freq)实际上是最大的时间之一。罪魁祸首似乎是当它破坏了一个NLTK FreqDist;这需要花费很多时间。

当然,即使是那个淫秽的时间也不到重新计算FreqDist的一半。有没有更好的方法来保存NLTK FreqDist对象?我尝试将其序列化为JSON,但这将其保存为一个简单的字典,失去了我需要的许多NLTK功能。

以下是代码:

def compute_ave_freq(word_forms):    
    fd = pickle.load(open("data/fd.txt", 'rb'))
    total_freq = 0
    for form in word_forms:
        freq = fd.freq(form)
        total_freq += freq
    try:
        ave_freq = total_freq/len(word_forms)
    except ZeroDivisionError:
        ave_freq = 0
    return ave_freq

这是LineProfiler输出:

Total time: 0.197121 s
File: /home/username/development/appname/filename.py
Function: compute_ave_freq at line 25
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
25                                           def compute_ave_freq(word_forms, debug=False):
26                                               # word_forms is a list of morphological variations of a word, such as
27                                               # ['كتبوا', 'كتبو', 'كتبنا', 'كتبت']
28                                           
29         1        78580  78580.0     79.1      fd = pickle.load(open("data/fd.txt", 'rb'))
30         1            3      3.0      0.0      total_freq = 0
31         5           10      2.0      0.0      for form in word_forms:
32         4        20676   5169.0     20.8          freq = fd.freq(form)
33         4            9      2.2      0.0          if debug==True:
34                                                       print(form, '\n', freq)
35         4            6      1.5      0.0          total_freq += freq
36         1            1      1.0      0.0      try:
37         1            3      3.0      0.0          ave_freq = total_freq/len(word_forms)
38                                               except ZeroDivisionError:
39                                                   ave_freq = 0
40         1            1      1.0      0.0      return ave_freq

谢谢!

1 个答案:

答案 0 :(得分:1)

正如评论中所建议的,将fd变量移到函数之外应解决问题:

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):    
    total_freq = 0
    for form in word_forms:
        freq = fd.freq(form)
        total_freq += freq
    try:
        ave_freq = total_freq/len(word_forms)
    except ZeroDivisionError:
        ave_freq = 0
    return ave_freq

但是,既然你正在创建一个求和平均函数,这里有一个更简单的实现:

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):
    try:
        return sum([fd.freq(form) for form in word_forms]) / len(word_forms)
    except ZeroDivisionError:
        return 0

或者:

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):
    l = len(word_forms)
    if  l > 0:
        return sum([fd.freq(form) for form in word_forms]) / l
    else:
        return 0

或更简单:

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):
    l = len(word_forms)
    return sum([fd.freq(form) for form in word_forms]) / l if l > 0 else 0

lambda

fd = pickle.load(open("data/fd.txt", 'rb'))
compute_ave_freq = lambda x: sum(fd.freq(i) for i in x)/len(x)
ave_freq = compute_ave_freq(word_forms) if len(word_forms) > 0 else 0

请查看EAFP and LBYL