我的进程非常慢,我希望使用多线程加速它。我的过程的目标是读取一个非常大的数据集并对每一行进行昂贵的计算,然后将结果存储在字典中。我想使用多线程,但我不知道如何。这是我的尝试。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from multiprocessing.pool import ThreadPool
def findTweets(side):
cosine_dict = {}
for t in tweets:
topic = [side, t]
tfidf_vectorizer = TfidfVectorizer()
topic_matrix = tfidf_vectorizer.fit_transform(topic)
cosine = cosine_similarity(topic_matrix[0:1], topic_matrix[1:2])
cosine = float(cosine)
key = side + "&&" + t
cosine_dict[key] = cosine
return cosine_dict
left = [] #just some strings
for l in left:
pool = ThreadPool(processes = 10)
result = pool.apply_async(findTweets, (l,))
cosine_dict_left = result.get()
这似乎并没有加快性能。我如何在这里应用多线程来加速这个过程?
答案 0 :(得分:0)
result.get()
是一个阻止通话。所以你一次只能运行一个任务。一个肮脏的修复方法是:
left = [] #just some strings
results=[]
pool = ThreadPool(processes = 10)
for l in left:
results.append(pool.apply_async(findTweets, (l,)))
for result in results:
cosine_dict_left = result.get()
#Do something with cosine_dict_left