停止删除单词出错了

时间:2015-04-13 05:36:21

标签: java file stop-words

对于一些IR purpouses,我想提取一些文本片段,在分析之前,我希望删除停用词。为此,我创建了一个txt停用词文件,然后使用以下代码,试图删除那些无用的词:

private static void stopWordRemowal() throws FileNotFoundException, IOException {

Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = br.readLine()) != null;)
   stopWords.add(line.trim());


BufferedReader  br2 = new BufferedReader(new FileReader("text"));
FileOutputStream theNewWords=new FileOutputStream(temp);

for(String readReady;(readReady = br2.readLine()) != null;)

    {
    StringTokenizer tokenizer =new StringTokenizer(readReady) ;
        String temp=tokenizer.nextToken();
        if(!stopWords.equals(temp))
        {   
            theNewWords.write(temp.getBytes());
            theNewWords.write(System.getProperty("line.separator").getBytes());
        }}

    }

但实际上它运作不佳。考虑以下示例文本片段:

Text summarization is the process of extracting salient information from the source text and to present that 
information to the user in the form of summary

输出将如下:

Text
summarization
is
the
process
of
extracting
salient
information
from
the
source
text
and
to
present
that
information
to
the
user
in
the
form
of
summary

几乎没有效果。但我不知道为什么。

1 个答案:

答案 0 :(得分:3)

您应该使用Set的contains方法而不是等于:

的方法
 if(!stopWords.contains(temp))//does set contains my string temp?

而不是

if(!stopWords.equals(temp))//set equals to string? not possible