我在我的项目中使用sklearn
countvectorizer
(4词ngram),在其中加载一个已经拟合的模型并使用它来转换文本。
我在项目中使用的方法是:transform
,inverse_transform
,vocabulary_.keys()
。
在使用它的同时,我意识到inverse_transform
需要时间(实例大约为2.7秒),而整个过程大约需要3秒。之后,我将模型插入一个使用许多线程(约30个线程)的类中,以便将该模型与其他模型一起使用,我只检查了此过程所花费的时间,发现该时间可以延长到15秒。 / p>
我的问题是:
为什么该过程在课堂上花费更多时间?
计数矢量化器线程安全吗?我找不到这个问题的明确答案
如果我用其他方法替换了inverse_transform
方法,这会使类中的过程更快吗?
谢谢
编辑
这是代码
for true_label, text in zip(labels, texts):
splitted_text = text.split(" ") # split text to remove features
boolean_text = [False] * len(splitted_text) # boolean list where the ith element is for the ith feature
ngrams = list(map(lambda x: ' '.join(x), zip(splitted_text, splitted_text[1:], splitted_text[2:], splitted_text[3:])))
doc_vocab = set(ngrams)
non_religious_tokens = doc_vocab - set(vectorizer.vocabulary_.keys())
print("non religious tokens:", non_religious_tokens)
if len(splitted_text) > 3 and non_religious_tokens == set():
print("Religious")
print("--- prediction time: %s seconds ---" % (time.time() - start_time))
continue
print("******* 2nd Filter ********")
# transform text
vectorized_data = vectorizer.transform([text])
terms = vectorizer.inverse_transform(vectorized_data) # get features of text
terms = terms[0]
maybe_religious = not (len(terms) == 0)
# for each feature
for term in terms:
splitted_term = term.split(" ") # split term (splitted to four pieces in out case)
# find 1st word index in text
# match the other 3 words
# set their indecies to true
for i, word in enumerate(splitted_text):
try:
# find the piece of text that match this feature and highlight all the words as true (so they're removed
# later)
if splitted_term[0] == splitted_text[i] and splitted_term[1] == splitted_text[i+1] and \
splitted_term[2] == splitted_text[i+2] and splitted_term[3] == splitted_text[i+3]:
boolean_text[i] = True
boolean_text[i+1] = True
boolean_text[i+2] = True
boolean_text[i+3] = True
except IndexError:
pass
non_religious_text = ""
# store non religious text (didn't match any feature in the text)
for i, b in enumerate(boolean_text):
if b is False:
non_religious_text += " " + splitted_text[i]
non_religious_text = remove_religious_stopwords(non_religious_text)
# if all the features are removed, it's religious!
if maybe_religious and (non_religious_text == "" or non_religious_text is None):
print("Religious")
# if the number of words remaining are lower than the limit
# it's religious
elif maybe_religious and (len(non_religious_text.split(" "))-1 <= max_non_religious):
print("Religious")
# other of that it's non religious
else:
print("Non Religious")
print("--- prediction time: %s seconds ---" % (time.time() - start_time))