在python中的巨大字典中高效地搜索字符串/值

时间:2019-05-11 16:44:38

标签: python-3.x multithreading optimization multiprocessing hpc

说我巨大的字典,例如huge_dict={'Key1': 'ABC' , 'Key 2' : 'DEF' ,'KEY 4' :'GHI', 'KEY5': 'IJK' ... , 'KEY N': 'XYZ'}
我花了很多时间来搜索huge_dict中的值,因为它使用了不同的核心,因此我尝试了多处理技术 我正在尝试以下步骤:
1:在m个小字典中拆分huge_dict
2:在python中创建m进程并将搜索值传递给它
3:如果任何进程获得该值,则终止所有进程。

在此之前,我会加载深度学习/机器学习模型。当尝试使用多处理时,随着我的进程的生成,它被加载为mnay次 其输出为huge_dict

 huge_dict = {'Key1': 'ABC' , 'Key 2' : 'DEF' ,'KEY 4' :'GHI', 'KEY5': 'IJK'}
 d1 = dict(huge_dict.items()[len(huge_dict)/2:])
 d2 = dict(huge_dict.items()[:len(huge_dict)/2])
# Is this an efficient  way to do it ? what if  I split in n dict 

def worker(dict , searck_value, num):
    """thread worker function"""
    print('Worker:', num)
    print(mp.cpu_count())
    return dict
#is is correct way to use multiprocessing

#current using  time consuming logic:
def search(d, word)
d = {'key1': "ASD", 'key2': "asd", 'key3':"fds", 'key4':"gfd", 'key5': "hjk"}
for key in d:
    if(d[key] in "search sentence or grp of words")#doing fuzzy search here
        return d[key]

数据格式如下:

huge_dict={"10001": ["sentence1", "sentence2","sentence3","sentence4"],
       "4001": ["sentence1", "sentence2"], 
"35432": ["sentence1", "sentence2","sentence3","sentence4", ... "sentence N"],  
.....
"N":["N no of sentences"]    }

1 个答案:

答案 0 :(得分:0)

我假设您要检查给定字符串中是否有huge_dict个值作为子字符串(不仅是单词)存在。
尝试set.intersection中的huge_dict.values()和给定字符串的所有子字符串是否更快:

def sub(s):
    """ Return all substrings of a given string """
    return [s[i:j+1] for i in range(len(s)) for j in range(i,len(s))]


huge_dict = {'Key1': 'ABC' , 'Key 2' : 'DEF' ,'KEY 4' :'GHI', 'KEY5': 'IJK'}
s = "A REKDEFY, CI"

huge_values = set(huge_dict.values())
>>> print(huge_values.intersection(sub(s))
{'DEF'}