如何从String中删除主题标签和URL

时间:2017-01-10 09:53:31

标签: java string

我想从字符串中删除主题标签和网址。

示例:

  • 之前:Cristiano Ronaldo is the #best player in the #world. https://..
  • 之后:Cristiano Ronaldo is the best player in the world.

如何实现这一目标?

3 个答案:

答案 0 :(得分:2)

首先应该用空字符串替换所有主题标签。

textWithoutHashtags

现在Pattern pattern = Pattern.compile("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"); Matcher matcher = pattern.matcher(textWithoutHashtags); String textWithoutHashtagsAndUrls = matcher.replaceAll(""); 是没有不需要的主题标签的初始文本。

接下来,您需要使用空字符串替换所有网址,我建议您使用正则表达式。

String ready = textWithoutHashtagsAndUrls.trim();

您之后可能还应修剪String以删除不必要的空格。

http://

请注意,使用的正则表达式仅适用于以https://ftp://www.google.de为前缀的网址。删除NoSuchSessionException无法正常工作。

答案 1 :(得分:0)

String类有一个replaceAll方法,它用定义的(甚至是空的)字符串替换每个字符/正则表达式。你可以看到Javadoc here

String tweet = "Cristiano Ronaldo is the #best player in the #world. http://www.google.com";
String tweetWithoutHash = tweet.replaceAll("#", "");
System.out.println(tweetWithoutHash); // Cristiano Ronaldo is the best player in the world. http://www.google.com
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
String tweetWithoutHashAndUrl = tweetWithoutHash.replaceAll(urlPattern, "");
System.out.println(tweetWithoutHashAndUrl); // Cristiano Ronaldo is the best player in the world.

答案 2 :(得分:0)

您可以使用此功能从用户推文removeStopwords(推文)中删除stopWord,标签和注释,对于停用词列表,您必须添加自己的列表或删除此步骤: `

public static ArrayList<String> removeStopwords (String tweet){
    ArrayList<String> wordsList = new ArrayList<String>();
     try{
            StringBuilder builder = new StringBuilder(tweet);
            String[] words = builder.toString().split("\\s");
            for (String word : words){
                wordsList.add(word.toLowerCase().trim());
            }
            wordsList.removeAll(stopwords);
            for(int ii = 0; ii < wordsList.size(); ii++){
                    String [] spl = wordsList.get(ii).split("@");
                    if (spl.length > 1){
                        wordsList.remove(ii);
                    }else {
                        String [] spl1 = wordsList.get(ii).split("#");
                        if (spl1.length > 1){
                            wordsList.remove(ii);
                        }
                    }
                if ((wordsList.get(ii).length() == 0)){
                    wordsList.remove(ii);
                }
            }
        }catch(Exception ex){
            System.out.println(ex);
        } 
    return wordsList;
}

`