从字符串列表中删除停用词

时间:2015-01-30 14:52:39

标签: java arrays arraylist stop-words

我有一个字符串列表,我想从这个列表中删除一些停用词:

for (int i = 0; i < simple_title.getItemCount(); i++) {
    // split the phrase into the words
    String str = simple_title.getItem(i);
    String[] title_parts = str.split(" ");
    ArrayList<String> list = new ArrayList<>(Arrays.asList(title_parts));
    for (int k = 0; k < list.size(); k++) {
        for (int l = 0; l < StopWords.stopwordslist.length; l++) {
            // stopwordslist is a Static Variable in class StopWords
            list.remove(StopWords.stopwordslist[l]);
        }
    }

    title_parts = list.toArray(new String[0]);
    for (String title_part : title_parts) {
        // and here I want to print the string
        System.out.println(title_part);
    }
    Arrays.fill(title_parts, null);
}

问题是在删除了停用词之后,我得到了title_part的唯一第一个索引,例如如果我有一个字符串列表,如:

 list of strings
 i am a list
 is remove stop there list...

删除停止词之后我才得到:

 list
 list
 remove

但我应该得到的是:

  list strings
  list
  remove stop list

我一直在努力,但现在我很困惑,有人可以告诉我,我做错了吗?

1 个答案:

答案 0 :(得分:1)

您正在从List数组的迭代定义的索引处移除StopWords中的项目!

所以删除是至少可以说是任意的,并且最终将取决于你的停止词的大小。

以下是您可能想要做的事情的自包含示例:

// defining the list of words (i.e. from your split)
List<String> listOfWords = new ArrayList<String>();
// adding some examples here (still comes from split in your case)
listOfWords.addAll(Arrays.asList("list", "of", "strings", "i", "am", "a", "list", "is", "remove", "stop", "there", "list"));
// defining an array of stop words (you probably want that as a constant somewhere else)
final String[] stopWords = {"of", "i", "am", "a", "is"};
// printing un-processed list
System.out.printf("Dirty: %s%n", listOfWords);
// invoking removeAll to remove all stop words
listOfWords.removeAll(Arrays.asList(stopWords));
// printing "clean" list
System.out.printf("Clean: %s%n", listOfWords);

<强>输出

Dirty: [list, of, strings, i, am, a, list, is, remove, stop, there, list]
Clean: [list, strings, list, remove, stop, there, list]