我有一张桌子T1
id Value1 Value2 Value3 Compared Related
---------------------------------------------------
af02 | AAA | BBB | CCC | 1 | 1
ff02 | ABA | BBB | CAC | 1 | af02
h2f0 | AAB | BBA | CCA | 0 | 0
Id不是自动递增,值1到3是文本我需要比较所有未比较的行(0)的Value2与所有其他Value2,以查看文本是否相同,如果它是相似的我需要将相似行的id添加到Related列,如果不是,我需要在Related列中添加1,我需要用python和mysql来完成这个
由于
答案 0 :(得分:0)
就mysql而言,我不知道从哪里开始
至于比较,我将使用来自db的readed值的余弦比较 像这样的东西:
train_set = [item['Value2'][i]]
test_set = [item['Value2'][i+=1]]
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
cx = lambda a, b: round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
for vector in trainVectorizerArray:
for testV in testVectorizerArray:
cosine = cx(vector, testV)
print cosine
我将使用余弦值来确定相似性和链接