IndriUI指数没有建立

时间:2014-11-01 10:20:10

标签: information-retrieval lemur indri

我正在尝试使用Indri UI构建索引。我创建了用于构建索引的参数文件和停用词列表。当我点击构建索引时,UI会长时间构建,并且永远不会构建索引。

enter image description here

用户界面暂停,

enter image description here

这是我的input.txt文件,

<DOC>
<DOCNO>
@switcheery
</DOCNO>
<TEXT>
Lol?"@elsidi01: "@switcheery: God bless that man that loves to see me happy......"#I"
</TEXT>
<DOCNO>
@Roseefly
</DOCNO>
<TEXT>
42% of Irish People have a Medical Card/Doctor Only Card. ##I have to admit we are a great little country #budget15 #healthcare
</TEXT>
<DOCNO>
@FammySaulkner
</DOCNO>
<TEXT>
@dthompsonRTS11 @Kirkpatrick_29 gosh dev you read my mind #I??crossfit
</TEXT>
<DOCNO>
@codesilence
</DOCNO>
<TEXT>
data mine the heart..for ??    #nsa  #i
</TEXT>
<DOCNO>
@ulidovmj
</DOCNO>
<TEXT>
Now That's What I Call Club Hits 2014: http://t.co/kd2xE5GZhq #nowalbum #album #ukcharts #uscharts #trending #i... http://t.co/tGe9wH6M0e
</TEXT>
<DOCNO>
@ulidovmj
</DOCNO>
<TEXT>
Now That's What I Call Club Hits 2014: http://t.co/kd2xE5GZhq #nowalbum #album #ukcharts #uscharts #trending #i... http://t.co/BmMMpLHcVA
</TEXT>
<DOCNO>
@ulidovmj
</DOCNO>
<TEXT>
Now That's What I Call Club Hits 2014: http://t.co/kd2xE5GZhq #nowalbum #album #ukcharts #uscharts #trending #i... http://t.co/GyuzOVA68T
</TEXT>
<DOCNO>
@ulidovmj
</DOCNO>
<TEXT>
Now That's What I Call Club Hits 2014: http://t.co/kd2xE5GZhq #nowalbum #album #ukcharts #uscharts #trending #i... http://t.co/sCw5U1DXMy
</TEXT>
<DOCNO>
@ulidovmj
</DOCNO>
<TEXT>
Now That's What I Call Club Hits 2014: http://t.co/kd2xE5GZhq #nowalbum #album #ukcharts #uscharts #trending #i... http://t.co/JwhqJoSN1T
</TEXT>
<DOCNO>
@SandySchmitz3
</DOCNO>
<TEXT>
Having kids is the biggest leap of faith a person can make. 2 create new lives & hope they spread goodness throughout the world. #I WISH
</TEXT>
<DOCNO>
@my_15minutes
</DOCNO>
<TEXT>
wubba lubba dub dub means I'm in great pain, please help me by winning the #I'dbemortyfied contest on @TheMarySue
</TEXT>
<DOCNO>
@darren1966h
</DOCNO>
<TEXT>
I managed to finish the Cheshire welcomes you! assignment! Try it for yourself! http://t.co/NYCrn7DQTu #GameInsight #iPad #i...
</TEXT>
<DOCNO>
@GomitasYnutella
</DOCNO>
<TEXT>
Set de fotos: dee-lirious: #i regret every day of my life i didn’t love you http://t.co/Z48py9uOOC
</TEXT>
<DOCNO>
@PernelleBdt
</DOCNO>
<TEXT>
"Un seul être vous manque et tout est dépeuplé." 
Ma plus belle étoile, mon plus beau souvenir.. 3 ans déjà.. #I #14102011  #memories ??
</TEXT>
<DOCNO>
@news8martha
</DOCNO>
<TEXT>
The 2.7 inches of rain that's fallen in La Crosse would translate to 27 inches of snow!
#I'll top complaining now!
</TEXT>
</DOC>

这是我的stopwords.txt,

<parameters>
<stopper> 
<word>happy</word>
<word>wondeful</word>
<word>sad</word>
<word>cute</word>
</stopper>
</parameters>

我错过了什么吗?请帮助我,我是IR新手。我不知道参数文件。我创建了一个,我不确定它在哪里使用。

1 个答案:

答案 0 :(得分:0)

我为禁用词列表做了什么,我只是在每行中写下没有任何标签的每个单词。另外我认为TRECTEXT format的正确方法是将每个文档放在<DOC></DOC>的一个标记中,然后在此标记内放置</DOCNO></TEXT>标记。例如:

<DOC>
<DOCNO>
@switcheery
</DOCNO>
<TEXT>
Lol?"@elsidi01: "@switcheery: God bless that man that loves to see me happy......"#I"
</TEXT>
</DOC>
<DOCNO>
@Roseefly
</DOCNO>
<TEXT>
42% of Irish People have a Medical Card/Doctor Only Card. ##I have to admit we are a great little country #budget15 #healthcare
</TEXT>
</DOC>