我想从字符串中删除主题标签和网址。
示例:
Cristiano Ronaldo is the #best player in the #world. https://..
Cristiano Ronaldo is the best player in the world.
如何实现这一目标?
答案 0 :(得分:2)
首先应该用空字符串替换所有主题标签。
textWithoutHashtags
现在Pattern pattern = Pattern.compile("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?");
Matcher matcher = pattern.matcher(textWithoutHashtags);
String textWithoutHashtagsAndUrls = matcher.replaceAll("");
是没有不需要的主题标签的初始文本。
接下来,您需要使用空字符串替换所有网址,我建议您使用正则表达式。
String ready = textWithoutHashtagsAndUrls.trim();
您之后可能还应修剪String以删除不必要的空格。
http://
请注意,使用的正则表达式仅适用于以https://
,ftp://
或www.google.de
为前缀的网址。删除NoSuchSessionException
无法正常工作。
答案 1 :(得分:0)
String类有一个replaceAll方法,它用定义的(甚至是空的)字符串替换每个字符/正则表达式。你可以看到Javadoc here。
String tweet = "Cristiano Ronaldo is the #best player in the #world. http://www.google.com";
String tweetWithoutHash = tweet.replaceAll("#", "");
System.out.println(tweetWithoutHash); // Cristiano Ronaldo is the best player in the world. http://www.google.com
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
String tweetWithoutHashAndUrl = tweetWithoutHash.replaceAll(urlPattern, "");
System.out.println(tweetWithoutHashAndUrl); // Cristiano Ronaldo is the best player in the world.
答案 2 :(得分:0)
您可以使用此功能从用户推文removeStopwords(推文)中删除stopWord,标签和注释,对于停用词列表,您必须添加自己的列表或删除此步骤: `
public static ArrayList<String> removeStopwords (String tweet){
ArrayList<String> wordsList = new ArrayList<String>();
try{
StringBuilder builder = new StringBuilder(tweet);
String[] words = builder.toString().split("\\s");
for (String word : words){
wordsList.add(word.toLowerCase().trim());
}
wordsList.removeAll(stopwords);
for(int ii = 0; ii < wordsList.size(); ii++){
String [] spl = wordsList.get(ii).split("@");
if (spl.length > 1){
wordsList.remove(ii);
}else {
String [] spl1 = wordsList.get(ii).split("#");
if (spl1.length > 1){
wordsList.remove(ii);
}
}
if ((wordsList.get(ii).length() == 0)){
wordsList.remove(ii);
}
}
}catch(Exception ex){
System.out.println(ex);
}
return wordsList;
}
`