在Java中删除String中的停用词

时间:2014-12-29 08:48:12

标签: java string stop-words

我有一个包含大量单词的字符串,我有一个文本文件,其中包含一些需要从我的字符串中删除的停用词。 我们说我有一个字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

删除停用词后,字符串应为:

"love phone, super fast much cool jelly bean....but recently bugs."

我已经能够实现这一点,但我遇到的问题是,当字符串中有相邻的停用词时,它只删除第一个,我得到结果为:

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

这是我的stopwordslist.txt文件:     Stopwords

我该如何解决这个问题。这就是我到目前为止所做的事情:

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

10 个答案:

答案 0 :(得分:5)

这是一个更优雅的解决方案(恕我直言),仅使用正则表达式:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

答案 1 :(得分:4)

尝试以下程序。

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

输出: 爱手机,它的超快速这么多新鲜的东西与果冻豆....但最近我看到了一些错误。

答案 2 :(得分:3)

你可以像这样使用替换所有功能

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

答案 3 :(得分:2)

错误是因为您从迭代的列表中删除了元素。 我们说wordsList包含|word0|word1|word2| 如果ii等于1且if测试为真,则您调用wordsList.remove(1);。之后,您的列表为|word0|word2|。然后,ii会增加并等于2,现在它高于列表的大小,因此永远不会测试word2

从那里有几种解决方案。例如,您可以将值设置为&#34;&#34;而不是删除值。或者创建一个特殊的结果&#34;列表。

答案 4 :(得分:1)

在这里尝试以下方式:

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

这样你的最终输出将没有你不想要的词。只需获取数组中的停用词列表并替换所需的字符串 输出我的停用词:

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

答案 5 :(得分:1)

相反,为什么不使用以下方法。阅读和理解会更容易:

for(String word : words){
    s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.

答案 6 :(得分:0)

尝试使用String的replaceAll api,如:

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

答案 7 :(得分:0)

尝试将停用词存储在集合集合中,然后将字符串标记为列表。 之后您可以简单地使用'removeAll'来获得结果。

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

不需要循环 - 它们通常意味着问题。

答案 8 :(得分:0)

似乎你停止了一句停止的单词被移除到一个句子移动到另一个停止词:你需要删除每个句子中的所有停用词。

您应该尝试更改代码:

自:

self.inherited

类似于:

raise

请注意,for(int ii = 0; ii < wordsList.size(); ii++){ for(int jj = 0; jj < k; jj++){ if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ wordsList.remove(ii); break; } } } 已删除,for(int ii = 0; ii < wordsList.size(); ii++) { for(int jj = 0; jj < k; jj++) { if(wordsList.get(ii).toLowerCase().contains(stopwords[jj]) { wordsList.remove(ii); } } } 已更改为break

答案 9 :(得分:0)

最近,在完成了一些博客和文章之后,该项目中的一个项目需要过滤来自给定文本或文件的停止/词干和咒骂词的功能。 创建了一个简单的库来过滤数据/文件并在maven中可用。希望这对某些人有所帮助。

https://github.com/uttesh/exude

Startup.cs