Question

在使用基于JAVA的文档分类器之前，有没有办法删除停用词（例如＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;等等）作为OpenNLP）等。或者如果你自己（使用JAVA）这样做可能是最有效的方法（鉴于字符串比较是低效的）。另外，鉴于每个文档本身并不那么大，即平均大约100个单词，但假定文档的数量很大。

E.g., 
// Populate the stop words to a list
List<String> stopWordsList = ArrayList<>();

// Iterate through a list of documents
String currentDoc = getCurrentDoc();

String[] wordsArray = currentDoc.split(" ");    

 for ( String word : wordsArray ) {

      if (stopWordsList.contains(word)){
           // Drop it
      }
  }

Answer 1

你的技术很好。但是，您应该将stopWordsList设置为Set而不是List，以便您可以在固定时间而不是线性时间内查找。换句话说，您不希望查看整个stopWordsList以查看word是否在那里;你想直接看看它是否在现场。

Answer 2

您可以尝试以下代码：

    String sentence = "This is a sample sentence for testing stop word deletion";

    String pattern = " a | the | for | is ";
    sentence = sentence.replaceAll(pattern, " ");

结果：此样本句子测试停止词删除

模式包含由管道分隔的所有停用词，表示模式可能包含其中任何一个。记住在停用词周围留出空格，将它们区分为确切的单词。如果不是空格，它将替换所有出现的停用词的字符组合，即使在单词中也是如此。

Answer 3

无需拆分，只需用空字符串替换目标字符串

String currentDoc = getCurrentDoc();
currentDoc = currentDoc.replace(stringToReplace,"");

或者，如果要替换多个单词，请使用replaceAll使用正则表达式。

使用Java从文本中删除停用词（例如，等等）的有效方法是什么

3 个答案: