Question

对于一些IR purpouses，我想提取一些文本片段，在分析之前，我希望删除停用词。为此，我创建了一个txt停用词文件，然后使用以下代码，试图删除那些无用的词：

private static void stopWordRemowal() throws FileNotFoundException, IOException {

Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = br.readLine()) != null;)
   stopWords.add(line.trim());


BufferedReader  br2 = new BufferedReader(new FileReader("text"));
FileOutputStream theNewWords=new FileOutputStream(temp);

for(String readReady;(readReady = br2.readLine()) != null;)

    {
    StringTokenizer tokenizer =new StringTokenizer(readReady) ;
        String temp=tokenizer.nextToken();
        if(!stopWords.equals(temp))
        {   
            theNewWords.write(temp.getBytes());
            theNewWords.write(System.getProperty("line.separator").getBytes());
        }}

    }

但实际上它运作不佳。考虑以下示例文本片段：

Text summarization is the process of extracting salient information from the source text and to present that 
information to the user in the form of summary

输出将如下：

Text
summarization
is
the
process
of
extracting
salient
information
from
the
source
text
and
to
present
that
information
to
the
user
in
the
form
of
summary

几乎没有效果。但我不知道为什么。

Answer 1

您应该使用Set的contains方法而不是等于：

的方法

 if(!stopWords.contains(temp))//does set contains my string temp?

而不是

if(!stopWords.equals(temp))//set equals to string? not possible

停止删除单词出错了

1 个答案: