我实现了称为MMR的机器学习算法,最大边际相关性。所以基本上我会有一个查询和文档,算法会计算我分配给文档的任何查询的相关速率。
现在,我使用tf-idf格式的20个新闻组数据集,在这里找到:(http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html)名为fea。我有点困惑,我不确定我的查询是否是tf-idf格式。因为我的代码中的查询和文档应该是tf-idf格式。
function [result,index] = mmr3(query,lambda,docs)
load fea1
fea1=fea1';
queries=zeros(1,26214);
queries(query)=1/(size(query,2)); %normalize and set values at appropriate places
query=queries';
A=fea1(:,docs);
%indexes of documents, 18846 different documents
filenames=[docs];
selected=A(:,1); %select first (most relevant) document, this assumes first document listed
%is also most relvant to the query
selectedNames=docs(1); %name of selected document
filenames(docs(1))=[];
rest=A(:,2:end); %other documents go to variable rest
for i=1:5 %sort top five most relevant documents
MMRmax=-10;
for k=1:size(rest,2) %loop through not yet selected documents
max1=0;
for i=1:size(selected,2) %loop through selected documents
max=sim1(selected(:,i),rest(:,k));
if max>max1 %look for most similar document from not yet selected and selected
max1=max; %remeber highest cosine similarity
end
end
MMR=lambda*(sim1(query,rest(:,k))-(1-lambda)*max1); %calculate MMR
if MMR>MMRmax %find max MMR
MMRmax=MMR;
result(i)=MMRmax;
selected2=k;
end
end ![enter image description here][1]
selected(:,i+1)=rest(:,selected2); %select document with highest MMR
selectedNames(i+1)=filenames(selected2); %name of selected document
rest(:,selected2)=[]; %delete that document from rest
filenames(selected2)=[];
end
index=selectedNames;
%selectedNames
我在查询和文档之间使用了余弦相似性:
function [sim2] = sim1(A,B)
sim2=(A'*B)/(norm(A)*norm(B));
if(isnan(sim2))
sim2=0;
end
这是输入和输出:
[result,index]=mmr3([1,2,3],0.2,[1:20])
result= 0.0012 -0.0018 -0.0040 -0.0043 -0.0080
index= 1 10 17 5 20 8
任何建议都将不胜感激。