我想删除输入文字中的所有特殊字符以及一些受限制的字词。
无论我想删除哪些内容,都会动态出现
(让我澄清一下:无论我需要排除的是什么,它们都将动态提供 - 用户将决定需要排除的内容。这就是我没有包含正则表达式的原因.entrict_words_list(请参阅我的代码)将获得从数据库中只是为了检查代码是否正常工作,我保持静态),
但出于演示目的,我将它们保存在String数组中以确认我的代码是否正常工作。
public class TestKeyword {
private static final String[] restricted_words_list={"@","of","an","^","#","<",">","(",")"};
private static final Pattern restrictedReplacer;
private static Set<String> restrictedWords = null;
static {
StringBuilder strb= new StringBuilder();
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
strb.setLength(strb.length()-1);
restrictedReplacer = Pattern.compile(strb.toString(),Pattern.CASE_INSENSITIVE);
strb = new StringBuilder();
}
public static void main(String[] args)
{
String inputText = "abcd abc@ cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
System.out.println("inputText : " + inputText);
String modifiedText = restrictedWordCheck(inputText);
System.out.println("Modified Text : " + modifiedText);
}
public static String restrictedWordCheck(String input){
Matcher m = restrictedReplacer.matcher(input);
StringBuffer strb = new StringBuffer(input.length());//ensuring capacity
while(m.find()){
if(restrictedWords==null)restrictedWords = new HashSet<String>();
restrictedWords.add(m.group()); //m.group() returns what was matched
m.appendReplacement(strb,""); //this writes out what came in between matching words
for(int i=m.start();i<m.end();i++)
strb.append("");
}
m.appendTail(strb);
return strb.toString();
}
}
输出结果为:
inputText:abcd abc @ cbda ssef of jjj t#he g ^ g an wh&amp; at ggg
修改后的文字:abcd abc @ cbda ssef jjj gg wh&amp; at gggg ss%ss ###(())DhD
此处排除的字词 和 ,但只有部分特殊字符,而不是我在restricted_words_list
中指定的所有字符
现在我有了更好的解决方案:
String inputText = title;// assigning input
List<String> restricted_words_list = catalogueService.getWordStopper(); // getting all stopper words from database dynamically (inside getWordStopper() method just i wrote a query and getting list of words)
String finalResult = "";
List<String> stopperCleanText = new ArrayList<String>();
String[] afterTextSplit = inputText.split("\\s"); // split and add to list
for (int i = 0; i < afterTextSplit.length; i++) {
stopperCleanText.add(afterTextSplit[i]); // adding to list
}
stopperCleanText.removeAll(restricted_words_list); // remove all word stopper
for (String addToString : stopperCleanText)
{
finalResult += addToString+";"; // add semicolon to cleaned text
}
return finalResult;
答案 0 :(得分:1)
public String replaceAll(String regex,
String replacement)
将此字符串的每个子字符串(与给定的正则表达式匹配)替换为给定的替换。
参数:
regex
- 此字符串所在的正则表达式
匹配replacement
- 要替换每场比赛的字符串。所以你只需要用空字符串提供替换参数。
答案 1 :(得分:0)
您可以考虑直接使用Regex将这些特殊字符替换为空“?”?请查看:Java; String replace (using regular expressions)?,此处有一些教程:http://www.vogella.com/articles/JavaRegularExpressions/article.html
答案 2 :(得分:0)
你也可以这样做:
String inputText = "abcd abc@ cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
String regx="([^a-z^ ^0-9]*\\^*)";
String textWithoutSpecialChar=inputText.replaceAll(regx,"");
System.out.println("Without Special Char:"+textWithoutSpecialChar);
String yourSetofString="of|an"; // your restricted words.
String op=textWithoutSpecialChar.replaceAll(yourSetofString,"");
System.out.println("output : "+op);
o / p:
Without Special Char:abcd abc cbda ssef of jjj the gg an what gggg ssss h
output : abcd abc cbda ssef jjj the gg what gggg ssss h
答案 3 :(得分:0)
String s = "abcd abc@ cbda ssef of jjj t#he g^g an wh&at ggg (blah) and | then";
String[] words = new String[]{ " of ", "|", "(", " an ", "#", "@", "&", "^", ")" };
StringBuilder sb = new StringBuilder();
for( String w : words ) {
if( w.length() == 1 ) {
sb.append( "\\" );
}
sb.append( w ).append( "|" );
}
System.out.println( s.replaceAll( sb.toString(), "" ) );
答案 4 :(得分:0)
你应该改变你的循环
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
到此:
for(String str:restricted_words_list){
strb.append("\\b*").append(Pattern.quote(str)).append("\\b*|");
}
因为只有在匹配后和之前有某些内容时,您的循环才匹配restricted_words_list
元素。由于abc@
在@
之后没有任何内容,因此不会被替换。如果您向*
添加\\b
(意味着0或更多次出现),它也会匹配abc@
之类的内容。