Python Gensim如何通过多处理使WMD相似性运行得更快

时间:2017-05-16 12:06:17

标签: python multithreading multiprocessing gensim

我正在尝试更快地运行gensim WMD相似性。通常,这是文档中的内容: 示例语料库:

    my_corpus = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']

model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)
from gensim import Word2Vec
from gensim.similarities import WmdSimilarity

def init_instance(my_corpus,model,num_best):
    instance = WmdSimilarity(my_corpus, model,num_best = 1)
    return instance
instance[my_tokenized_query]

最匹配的文档是"Human machine interface for lab abc computer applications",这很棒。

但是上面的函数instance需要很长时间。所以我想把语料库分成N个部分然后用WMD对每个部分进行num_best = 1,然后在最后,最高分的部分将是最相似的。

    from multiprocessing import Process, Queue ,Manager

    def main( my_query,global_jobs,process_tmp):
        process_query = gensim.utils.simple_preprocess(my_query)

        def worker(num,process_query,return_dict):  
            instance=init_instance\
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
            x = instance[process_query][0][0]
            y = instance[process_query][0][1]
            return_dict[x] = y
        manager = Manager()
        return_dict = manager.dict()

        for num in range(num_workers):
            process_tmp = Process(target=worker, args=(num,process_query,return_dict))
            global_jobs.append(process_tmp)
            process_tmp.start()
        for proc in global_jobs:
            proc.join()

        return_dict = dict(return_dict)
        ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
        print corpus[ind]
        >>> "Graph minors A survey"

我遇到的问题是,即使它输出了一些东西,它也不能从我的语料库中给出一个很好的类似查询,即使它获得了所有部分的最大相似性。

我做错了吗?

2 个答案:

答案 0 :(得分:5)

  

评论:chunk是一个静态变量:例如chunk = 600 ......

如果您定义chunk静态,则必须计算num_workers

10001 / 600 = 16,6683333333 = 17 num_workers

使用process更多cores而不是17 cores。< 如果您有cores,那就没问题。

num_workers = os.cpu_count() chunk = chunksize(my_corpus, num_workers) 是静态的,因此你应该:

#process_query = gensim.utils.simple_preprocess(my_query)
process_query = my_tokenized_query
  1. 结果不一样,改为:

    worker
  2. 所有return_dict[x]结果索引0..n。
    因此,my_corpus可能会被具有较低值的相同索引的最后一个工作程序覆盖。 return_dict中的索引 NOT #return_dict[x] = y return_dict[ (num * chunk)+x ] = y 中的索引相同。改为:

    +1
  3. 在块大小计算中使用 chunk ,将跳过第一个文档
    我不知道你如何计算def chunksize(iterable, num_workers): c_size, extra = divmod(len(iterable), num_workers) if extra: c_size += 1 if len(iterable) == 0: c_size = 0 return c_size #Usage chunk = chunksize(my_corpus, num_workers) ... #my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk] my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk] ,请考虑这个例子:

    multiprocessing
  4.   

    结果:10个周期,元组=(索引工人数= 0,索引工人数= 1)

         
        

    chunk=5multiprocessing
        02,09:(3,8),01,03:(3,5):
        EPS的系统和人体系统工程测试     04,06,07:(0,8),05,08:(0,5),10:(0,7):
        实验室abc计算机应用的人机界面

             

    没有chunk=5multiprocessing
        01:(3,6),02:(3,5),05,08,10:(3,7),07,09:(3,8):
        EPS的系统和人体系统工程测试     03,04,06:(0,5):
        实验室abc计算机应用的人机界面

             

    没有show,没有分块:
        01,02,03,04,06,07,08:(3,-1):
        EPS的系统和人体系统工程测试     05,09,10:(0,-1):
        实验室abc计算机应用的人机界面

      

    使用Python测试:3.4.2

答案 1 :(得分:0)

使用Python 2.7: 我使用线程而不是多处理。 在WMD-Instance创建线程中,我做了类似这样的事情:

    wmd_instances = []
    if wmd_instance_count > len(wmd_corpus):
        wmd_instance_count = len(wmd_corpus)
    chunk_size = int(len(wmd_corpus) / wmd_instance_count)
    for i in range(0, wmd_instance_count):
        if i == wmd_instance_count -1:
            wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:], wmd_model, num_results)
        else:
            wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:chunk_size], wmd_model, num_results)
        wmd_instances.append(wmd_instance)
    wmd_logic.setWMDInstances(wmd_instances, chunk_size)

&#39; wmd_instance_count&#39;是用于搜索的线程数。我还记得那块大小的。然后,当我想搜索某些东西时,我开始&#34; wmd_instance_count&#34; -threads搜索并返回找到的sims:

def perform_query_for_job_on_instance(wmd_logic, wmd_instances, query, jobID, instance):
    wmd_instance = wmd_instances[instance]
    sims = wmd_instance[query]
    wmd_logic.set_mt_thread_result(jobID, instance, sims)

&#39; wmd_logic&#39;是一个类的实例,然后执行此操作:

def set_mt_thread_result(self, jobID, instance, sims):
    res = []
    #
    # We need to scale the found ids back to our complete corpus size...
    #
    for sim in sims:
        aSim = (int(sim[0] + (instance * self.chunk_size)), sim[1])
        res.append(aSim)

我知道,代码并不好,但它确实有用。它使用&#39; wmd_instance_count&#39;线程找到结果,我聚合它们然后选择前10或类似的东西。

希望这有帮助。