Question

我正致力于推特数据规范化。 Twitter用户经常使用像我这样的术语，以便强调爱这个词。我希望通过替换重复的字符来重复这些重复的字符，直到我得到一个合适的有意义的单词（我知道我无法通过这种机制来区分善与否）。

我的策略是

识别这种重复字符串的存在。我会寻找超过2个相同的字符，因为可能没有超过两个重复字符的英文单词。

String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };

String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);

for (String string : strings) {
     Matcher matcher = pattern.matcher(string);
     if (matcher.find()) {
         System.out.println(string+" TRUE ");
     }
}

在像Wordnet这样的词典中搜索这样的词
替换除了两个这样的重复字符以外的所有字符并检入Lexicon
如果不在Lexicon中，则删除一个重复字符（否则将其视为拼写错误）。

由于我的Java知识不足，我无法管理3和4.问题是我无法替换除了两个重复的连续字符之外的所有字符。以下代码段会替换除一个重复字符System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));

需要帮助才能找到答案 A.如何替换除2个连续重复字符以外的所有字符 B.如何从A的输出中删除一个连续的字符 [我认为B可以通过以下代码片段进行管理]

System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));

编辑：WiktorStribiżew提供的解决方案在Java中完美运行。我想知道在python中获得相同结果需要进行哪些更改。 Python使用re.sub。

Answer 1

您的正则表达式([a-z])\\1{2,}匹配并将ASCII字母捕获到第1组，然后匹配此值的2次或更多次出现。因此，您需要使用支持捕获值的反向引用$1替换所有内容。如果您使用一个$1，则aaaaa将替换为一个a，如果您使用$1$1，则会将其替换为aa。

String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");

请参阅bundle-deps。

如果您需要对正则表达式不区分大小写，请使用"(?i)([a-z])\\1{2,}"甚至"(\\p{Alpha})\\1{2,}"。如果必须处理任何Unicode字母，请使用"(\\p{L})\\1{2,}"。

Answer 2

/*This code checks a character in a given string repeated consecutively 3 times
 if you want to check for 4 consecutive times change count==2--->count==3 OR
 if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
    static char ch;
    public static void main(String[] args) {
        String str="aabbbbccc";
        char[] charArray = str.toCharArray();
        int count=0;
        for(int i=0;i<charArray.length;i++){
            if(i!=0 ){
            if(charArray[i]==ch)continue;//ddddee
            if(charArray[i]==charArray[i-1]) {
                count++;
                if(count==2){
                    System.out.println(charArray[i]);
                    count=0;
                    ch=charArray[i];
                }   
            }
            else{
                count=0;//aabb

            }
            }


        }

    }

}

替换java中连续重复的字符

2 个答案: