nltk具有很好的word2word相似度函数,它通过条件与普通上位词的接近程度来测量相似度。虽然这种相似度函数不适用于2个术语从pos标签到pos标签不同的情况,但仍然很好。
然而,我发现它太慢了......它比术语匹配慢10倍。无论如何,nltk相似性函数会变得更快吗?
我已使用以下代码进行测试:
from nltk import stem, RegexpStemmer
from nltk.corpus import wordnet, stopwords
from nltk.tag import pos_tag
import time
file1 = open('./tester.csv', 'r')
def similarityCal(word1, word2):
synset1 = wordnet.synsets(word1)
synset2 = wordnet.synsets(word2)
if len(synset1) != 0 and len(synset2) != 0:
wordFromList1 = synset1[0]
wordFromList2 = synset2[0]
return wordFromList1.wup_similarity(wordFromList2)
else:
return 0
start_time = time.time()
file1lines = file1.readlines()
stopwords = stopwords.words('english')
previousLine = ""
currentLine = ""
cntOri = 0
cntExp = 0
for line1 in file1lines:
currentLine = line1.lower().strip()
if previousLine == "":
previousLine = currentLine
continue
else:
for tag1 in pos_tag(currentLine.split(" ")):
tmpStr1 = tag1[0];
if tmpStr1 not in stopwords and len(tmpStr1) > 1:
if tmpStr1 in previousLine:
print("termMatching word", tmpStr1);
cntOri = cntOri + 1
for tag2 in pos_tag(previousLine.split(" ")):
tmpStr2 = tag2[0];
if tag1[1].startswith("NN") and tag2[1].startswith("NN") or tag1[1].startswith("VB") and tag2[1].startswith("VB"):
value = similarityCal(tmpStr1, tmpStr2)
if type(value) is float and value > 0.8:
print(tmpStr1, " similar to " , tmpStr2 , " ", value)
cntExp = cntExp + 1
previousLine = currentLine
end_time = time.time()
print ("time taken : ",end_time - start_time, " // ", cntOri, " | ", cntExp)
file1.close()
我只是注释掉相似度函数来比较性能。
我使用过本网站的样本: https://www.briandunning.com/sample-data/
有什么想法吗?