I am trying to match the sentences with the highest cosine similarity in a dataset consisting of 10.000 questions that were asked on a forum. i have created the algorithm already an i am seeing great results. however, the computation takes a long time and that is only for one sentence. Is there an efficient way to match all the sentences together? The for loop is not an sufficient method here i guess.
import re
import math
from collections import Counter
import csv
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
#nltk.download("punkt")
#nltk.download("stopwords")
#print()
#print()
stopwords = set(stopwords.words('english'))
reader = csv.reader(open("/Users/stefan/dev.csv"))
dev = []
for line in reader:
dev.append(line[1])
reader2 = csv.reader(open("/Users/stefan/test.csv"))
test = []
for line2 in reader2:
test.append(line2[1])
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator / denominator)
def text_to_vector(text):
word_tokens = word_tokenize(text)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
#print("the original tokens derived:", word_tokens)
#print()
#print("the resulted work tokens derived: ",filtered_sentence)
return Counter(filtered_sentence)
def get_result(content_a, content_b):
text1 = content_a
text2 = content_b
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine_result = get_cosine(vector1, vector2)
return cosine_result
print("Have a cosine similarity of: ", get_result(dev[1], test[1]))
The result here will be: Have a cosine similarity of: 0.3779644730092272. However, this is only working for one sentence. When i tried running it on the first 5 sentences, i already had to wait 10 minutes. Considering i want to match 10.000 sentences i am looking for an efficient way to process the algorithm. I have found some information on multiprocessing with Pool, but i am not sure how to implement that here. what i have now:
finalresult = []
for countnum in range(1, 5):
testchecker = []
testresult = []
for i in range(1, len(test)):
testresult = get_result(test[countnum], test[i])
if i != countnum:
testchecker.append([testresult,countnum,test[countnum], i-1])
resulttest = max(testchecker, key=lambda x: x[0])
finalresult.append([resulttest[1]-1, resulttest[2], resulttest[3]])
print(finalresult)
The result will be this: An ID, the text, and the ID of the other question it is similar to.
[[0, 'What are the hottest IT startup companies in Mumbai?', 8808], [1, 'How often do you drink coffee (-based) drinks?', 2103], [2, 'Which contries provide financial help to India?', 1386], [3, 'What are some interesting facts about the NSG?', 3472]]
Is there a way to make this more computationally efficient?