Question

我正在使用python Fuzzywuzzy库进行地址字符串匹配，并使用多处理模块对其进行并行化。这是我当前的代码，

import pandas as pd
from multiprocessing import Pool
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})

test=test.assign(dataset = 'test')
test2=test2.assign(dataset = 'test2')

newdf=pd.concat([test2,test],keys = ['test2','test'])
gpd=newdf.groupby('city')

def my_func(mygrp):
    test_data=mygrp.loc['test']
    test2_data=mygrp.loc['test2']
    test2_data = test2_data.loc[:,['ID','Address1']].set_index('ID') 
    del test2_data.index.name 
    other = {x.Index : x.Address1 for x in test2_data.itertuples()}
    test_data['applied']=test_data['Address1'].apply(lambda x: process.extract(x, other , limit = 1))
    return test_data

mypool=Pool(processes=2)
ret_list=mypool.imap(my_func,(group for name, group in gpd))
result = pd.concat(ret_list, axis=0)

这很好。它在一个大型数据集上工作（test和test2每个都有超过300万条记录）。因此，利用具有更高处理能力的ec2实例-64GB，但是我们现在正在迁移到具有32GB的较小实例，因此，如果可能的话，可能尝试在多线程中重写它。从我到目前为止所读的内容来看，由于GIL，似乎在python中进行多线程CPU绑定任务不可行。有什么建议吗？还有其他可能的解决方案可以加快此任务的速度吗？

还是可以使用word2vec进行此字符串匹配？

在python中使用多线程

0 个答案: