对于一些IR purpouses,我想提取一些文本片段,在分析之前,我希望删除停用词。为此,我创建了一个txt
停用词文件,然后使用以下代码,试图删除那些无用的词:
private static void stopWordRemowal() throws FileNotFoundException, IOException {
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = br.readLine()) != null;)
stopWords.add(line.trim());
BufferedReader br2 = new BufferedReader(new FileReader("text"));
FileOutputStream theNewWords=new FileOutputStream(temp);
for(String readReady;(readReady = br2.readLine()) != null;)
{
StringTokenizer tokenizer =new StringTokenizer(readReady) ;
String temp=tokenizer.nextToken();
if(!stopWords.equals(temp))
{
theNewWords.write(temp.getBytes());
theNewWords.write(System.getProperty("line.separator").getBytes());
}}
}
但实际上它运作不佳。考虑以下示例文本片段:
Text summarization is the process of extracting salient information from the source text and to present that
information to the user in the form of summary
输出将如下:
Text
summarization
is
the
process
of
extracting
salient
information
from
the
source
text
and
to
present
that
information
to
the
user
in
the
form
of
summary
几乎没有效果。但我不知道为什么。
答案 0 :(得分:3)
您应该使用Set的contains方法而不是等于:
的方法 if(!stopWords.contains(temp))//does set contains my string temp?
而不是
if(!stopWords.equals(temp))//set equals to string? not possible