用于matlab中搜索查询的TF-IDF

时间:2014-05-03 02:43:18

标签: matlab machine-learning tf-idf

我实现了称为MMR的机器学习算法,最大边际相关性。所以基本上我会有一个查询和文档,算法会计算我分配给文档的任何查询的相关速率。

现在,我使用tf-idf格式的20个新闻组数据集,在这里找到:(http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html)名为fea。我有点困惑,我不确定我的查询是否是tf-idf格式。因为我的代码中的查询和文档应该是tf-idf格式。

function [result,index] = mmr3(query,lambda,docs)

load fea1

fea1=fea1'; 

queries=zeros(1,26214);

queries(query)=1/(size(query,2)); %normalize and set values at appropriate places
query=queries';
A=fea1(:,docs);
%indexes of documents, 18846 different documents
filenames=[docs];

selected=A(:,1);   %select first (most relevant) document, this assumes first document listed
                %is also most relvant to the query

selectedNames=docs(1);  %name of selected document
filenames(docs(1))=[];
rest=A(:,2:end);   %other documents go to variable rest

for i=1:5 %sort top five most relevant documents
MMRmax=-10;                   
for k=1:size(rest,2)      %loop through not yet selected documents
max1=0;
for i=1:size(selected,2) %loop through selected documents
max=sim1(selected(:,i),rest(:,k));       
if max>max1         %look for most similar document from not yet selected and selected
max1=max;         %remeber highest cosine similarity
        end
   end   
   MMR=lambda*(sim1(query,rest(:,k))-(1-lambda)*max1);  %calculate MMR
        if MMR>MMRmax                   %find max MMR
          MMRmax=MMR;
          result(i)=MMRmax;
         selected2=k;
        end
end  ![enter image description here][1]

selected(:,i+1)=rest(:,selected2);      %select document with highest MMR 
selectedNames(i+1)=filenames(selected2);  %name of selected document
rest(:,selected2)=[];                   %delete that document from rest
filenames(selected2)=[];

end
index=selectedNames;
%selectedNames 

我在查询和文档之间使用了余弦相似性:

function [sim2] = sim1(A,B)

sim2=(A'*B)/(norm(A)*norm(B));
if(isnan(sim2))
sim2=0; 
end

这是输入和输出:

[result,index]=mmr3([1,2,3],0.2,[1:20])

result= 0.0012   -0.0018    -0.0040    -0.0043    -0.0080

index= 1    10    17    5     20   8

任何建议都将不胜感激。

0 个答案:

没有答案