我知道还有其他一些问题存在类似的问题,但没有一个能够涵盖我所寻找的问题。我正在寻找一种相对快速的方法来计算矩阵中每个项目与其他项目的相似性。我正在测试一种NLP技术,以衡量它在衡量文档相似性方面的有效性。
所以我有一个像这样的行的矩阵(我目前将它存储为字典,但我可以将其转换为我相信的矩阵): A = [a1,a2,...,am] B = [b1,b2,...,bm] ... N = [n1,n2,...,nm]
我的算法是我遍历每个类别。在一个类别中,我遍历该类别中的元素。然后,对于每个元素,我遍历同一类别中的每个元素并找到平均相似度。对于每个元素,我还遍历该类别之外的每个元素并找到平均相似度。我不断平均每个元素的平均值,这为我提供了一种“平均值”"与outCategory文档相比,inCategoy文档彼此之间的相似程度。这是我的代码:
class SimilarityTesting:
def __init__(self,documents,space,documentdictionary, tfidf=None,hes=False,printer=False):
self.documents = documents
self.space = space
self.documentDictionary = documentdictionary
self.tfidf = tfidf
self.hes = hes
self.printer = printer
self.pairwiseDictionary = {}
#self.testingExecute()
def testingExecute(self):
withinCategorySum = 0
outsideCategorySum = 0
i=0
for categoryLabel, categoryDocuments in self.documents.iteritems():
categoryAverage,setAverage = self.categoryComparison(categoryLabel, categoryDocuments)
withinCategorySum+=categoryAverage
outsideCategorySum +=setAverage
i+=1
withinAverage = withinCategorySum/i
outsideAverage = outsideCategorySum/i
ratio = withinAverage/outsideAverage
print "The average similarity of documents within their category is %s" % withinAverage
print "The average similarity of documents not within their category is %s" % outsideAverage
print "The Ratio of Difference is %s" % ratio
print
return withinAverage, outsideAverage, ratio
def categoryComparison(self,categoryLabel,categoryDocuments):
if self.printer: print "CATEGORY ", categoryLabel
categorySum = 0
nonCategorySum = 0
i =0
nonCategory = [x for x in self.documents.values() if x != categoryDocuments]
flatNoncategory = [val for sublist in nonCategory for val in sublist]
#I believe this is the best place to do parallelization
for element in categoryDocuments:
categorySim = self.itemPairCompare(element,categoryDocuments)
categorySum+=categorySim
nonCategorySim = self.itemPairCompare(element,flatNoncategory)
nonCategorySum+=nonCategorySim
i+=1
categoryAverage = categorySum/i
noncategoryAverage = nonCategorySum/i
if self.printer:
print
print "AVERAGE SIMILARITY OF CATEGORY ITEM-IN-CATEGORY SIMILARTY ", categoryAverage
print "AVERAGE SIMILART OF CATEGORY'S ITEM-NOT-IN-CATEGORY SIMILARITY ", noncategoryAverage
print
print
return categoryAverage,noncategoryAverage
def itemPairCompare(self,item, listDocuments):
#print "ITEM WITHIN CATEGORY"
sum = 0
i = 0
for value in listDocuments:
if item != value:
itemID = self.documentDictionary[str(item)]
valueID = self.documentDictionary[str(value)]
pair1ID = itemID + valueID
pair2ID = valueID + itemID
if pair1ID in self.pairwiseDictionary:
sim = self.pairwiseDictionary[pair1ID]
elif pair2ID in self.pairwiseDictionary:
sim = self.pairwiseDictionary[pair2ID]
if self.tfidf:
vec1_tfidf=self.tfidf[item]
vec1 = self.space[vec1_tfidf]
vec2_tfidf = self.tfidf[value]
vec2 = self.space[vec2_tfidf]
sim = matutils.cossim(vec1, vec2)
elif self.hes:
vec1=self.space[item]
vec2=self.space[value]
dense1 = gensim.matutils.sparse2full(vec1, self.space.num_topics)
dense2 = gensim.matutils.sparse2full(vec2, self.space.num_topics)
hes = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
sim = 1-hes
self.pairwiseDictionary[pair1ID] = sim
else:
sim = matutils.cossim(self.space[item],self.space[value])
self.pairwiseDictionary[pair1ID] = sim
#print sim
sum+=sim
i+=1
average = sum/i
if self.printer: print "ITEM'S AVERAGE SIMILARITY TO ITEMS IN CATEGORY IS ", average
return average
此代码位于较大管道的下游,该管道涉及将文档标记化并将其转换为单词包,因此我无法提供示例矩阵来运行它。我更多地提供了代码,以显示我现在所处的位置。基本上,我想知道我是否可以创建一个向量矩阵,然后使用numpy比迭代更快地将每个向量与每个其他向量进行比较。
我设法通过创建已经发生的元素对字典来节省一些时间,以避免重复计算。然而,这件事仍然需要很长时间才能运行(O(NM ^ 2),我相信),并且我试图找出优化或平行化的方法。这是NumPy擅长的东西吗?另外,我已经读过使用python进行多核处理有点困难。这是多处理难以解决的问题吗?有没有人对如何以更优化的方式做到这一点有任何建议?感谢