efficient way to calculate all cosine similarities of sentences

时间:2018-12-03 12:54:42

标签: python performance complexity-theory similarity cosine

I am trying to match the sentences with the highest cosine similarity in a dataset consisting of 10.000 questions that were asked on a forum. i have created the algorithm already an i am seeing great results. however, the computation takes a long time and that is only for one sentence. Is there an efficient way to match all the sentences together? The for loop is not an sufficient method here i guess.

import re
import math
from collections import Counter
import csv
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#nltk.download("punkt")
#nltk.download("stopwords")

#print()
#print()
stopwords = set(stopwords.words('english'))
reader = csv.reader(open("/Users/stefan/dev.csv"))
dev = []
for line in reader:
    dev.append(line[1])

reader2 = csv.reader(open("/Users/stefan/test.csv"))
test = []
for line2 in reader2:
    test.append(line2[1])

def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator / denominator)

def text_to_vector(text):
    word_tokens = word_tokenize(text)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    #print("the original tokens derived:", word_tokens)
    #print()
    #print("the resulted work tokens derived: ",filtered_sentence)
    return Counter(filtered_sentence)

def get_result(content_a, content_b):
    text1 = content_a
    text2 = content_b

    vector1 = text_to_vector(text1)
    vector2 = text_to_vector(text2)

    cosine_result = get_cosine(vector1, vector2)
    return cosine_result

print("Have a cosine similarity of: ", get_result(dev[1], test[1]))

The result here will be: Have a cosine similarity of: 0.3779644730092272. However, this is only working for one sentence. When i tried running it on the first 5 sentences, i already had to wait 10 minutes. Considering i want to match 10.000 sentences i am looking for an efficient way to process the algorithm. I have found some information on multiprocessing with Pool, but i am not sure how to implement that here. what i have now:

finalresult = []
for countnum in range(1, 5):
    testchecker = []
    testresult = []
    for i in range(1, len(test)):
        testresult = get_result(test[countnum], test[i])
        if i != countnum:
            testchecker.append([testresult,countnum,test[countnum], i-1])
    resulttest = max(testchecker, key=lambda x: x[0])
    finalresult.append([resulttest[1]-1, resulttest[2], resulttest[3]])

print(finalresult)

The result will be this: An ID, the text, and the ID of the other question it is similar to.

[[0, 'What are the hottest IT startup companies in Mumbai?', 8808], [1, 'How often do you drink coffee (-based) drinks?', 2103], [2, 'Which contries provide financial help to India?', 1386], [3, 'What are some interesting facts about the NSG?', 3472]]

Is there a way to make this more computationally efficient?

0 个答案:

没有答案