RegEx分裂camelCase或TitleCase(高级)

时间:2011-09-29 07:36:39

标签: java regex camelcasing title-case

我找到了brilliant RegEx来提取camelCase或TitleCase表达式的一部分。

 (?<!^)(?=[A-Z])

按预期工作:

  • 值 - &gt;值
  • camelValue - &gt;骆驼/价值
  • TitleValue - &gt;标题/价值

例如使用Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

我的问题是它在某些情况下不起作用:

  • 案例1:VALUE - &gt; V / A / L / U / E
  • 案例2:eclipseRCPExt - &gt; eclipse / R / C / P / Ext

在我看来,结果应该是:

  • 案例1:VALUE
  • 案例2:eclipse / RCP / Ext

换句话说,给定n个大写字符:

  • 如果n个字符后跟小写字符,则组应为:(n-1个字符)/(第n个字符+低字符)
  • 如果n个字符在末尾,则该组应为:(n个字符)。

关于如何改进这个正则表达式的任何想法?

11 个答案:

答案 0 :(得分:99)

以下正则表达式适用于以上所有示例:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   

它的工作原理是强制消极的lookbehind不仅忽略字符串开头的匹配,而且还忽略大写字母前面有另一个大写字母的匹配。这会处理像“VALUE”这样的情况。

正则表达式的第一部分本身因“eclipseRCPExt”而无法在“RPC”和“Ext”之间拆分。这是第二个条款的目的:(?<!^)(?=[A-Z][a-z]。此子句允许在每个大写字母之前进行拆分,后跟小写字母,但字符串的开头除外。

答案 1 :(得分:67)

看起来你正在变得比它需要的更复杂。对于 camelCase ,拆分位置只是大写字母紧跟小写字母的任何位置:

(?<=[a-z])(?=[A-Z])

以下是此正则表达式如何拆分您的示例数据:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCPExt

与您想要的输出的唯一区别在于eclipseRCPExt,我认为这是正确分割的。

附录 - 改进版

注意:这个答案最近得到了一个upvote,我意识到有更好的方法......

通过添加上述正则表达式的第二种替代方法,所有OP的测试用例都被正确分割。

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

以下是改进的正则表达式如何拆分示例数据:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCP / Ext

编辑:20130824 已添加改进版本以处理RCPExt -> RCP / Ext案例。

答案 2 :(得分:27)

另一种解决方案是在commons-lang中使用专用方法:StringUtils#splitByCharacterTypeCamelCase

答案 3 :(得分:10)

我无法获得aix的解决方案(并且它也无法在RegExr上运行),所以我想出了我自己的测试,似乎正在寻找你正在寻找的东西:

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))

以下是使用它的示例:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)

这里我用空格分隔每个单词,所以这里有一些如何转换字符串的例子:

  • ThisIsATitleCASEString =&gt;这是一个标题CASE字符串
  • andThisOneIsCamelCASE =&gt;这一个是骆驼案例

上面的解决方案完成了原始帖子要求的内容,但我还需要一个正则表达式来查找包含数字的驼峰和pascal字符串,所以我也想出了这个变体来包含数字:

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))

以及使用它的一个例子:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)

以下是一些如何使用此正则表达式转换带数字的字符串的示例:

  • myVariable123 =&gt;我的变量123
  • my2Variables =&gt;我的2个变量
  • The3rdVariableIsHere =&gt; 3 rdVariable就在这里
  • 12345NumsAtTheStartIncludedToo =&gt; 12345 Nums At Start包含太多

答案 4 :(得分:2)

处理的信件多于A-Z

s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");

或者:

  • 在任何小写字母后分割,后跟大写字母。

E.g parseXML - &gt; parseXML

  • 在任何字母后分割,后跟大写字母和小写字母。

E.g。 XMLParser - &gt; XMLParser


以更易读的形式:

public class SplitCamelCaseTest {

    static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
    static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";

    static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
        BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
    );

    public static String splitCamelCase(String s) {
        return SPLIT_CAMEL_CASE.splitAsStream(s)
                        .collect(joining(" "));
    }

    @Test
    public void testSplitCamelCase() {
        assertEquals("Camel Case", splitCamelCase("CamelCase"));
        assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
        assertEquals("XML Parser", splitCamelCase("XMLParser"));
        assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
        assertEquals("VALUE", splitCamelCase("VALUE"));
    }    
}

答案 5 :(得分:2)

这里的最佳答案都提供了使用正面lookbehinds的代码,所有正则表达式都不支持。下面的正则表达式将捕获PascalCasecamelCase,并且可以用于多种语言。

注意:我确实意识到这个问题是关于Java的,但是,我也看到这篇文章在其他问题中被多次提及标记为不同的语言,以及对此问题的一些评论

代码

See this regex in use here

([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)

结果

示例输入

eclipseRCPExt

SomethingIsWrittenHere

TEXTIsWrittenHERE

VALUE

loremIpsum

样本输出

eclipse
RCP
Ext

Something
Is
Written
Here

TEXT
Is
Written
HERE

VALUE

lorem
Ipsum

说明

  • 匹配一个或多个大写字母字符[A-Z]+
  • 匹配零个或一个大写字母字符[A-Z]?,后跟一个或多个小写字母字符[a-z]+
  • 确保以下内容是大写字母字符[A-Z]或字边界字符\b

答案 6 :(得分:1)

您可以使用来自Apache Commons Lang的StringUtils。splitByCharacterTypeCamelCase(“ loremIpsum”)。

答案 7 :(得分:0)

您可以将以下表达式用于Java:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)

答案 8 :(得分:0)

您可能还会考虑找到名称组件(那些肯定存在),而不是寻找不存在的分隔符:

String test = "_eclipse福福RCPExt";

Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);

Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
    // matches should be consecutive
    if (componentMatcher.start() != endOfLastMatch) {
        // do something horrible if you don't want garbage in between

        // we're lenient though, any Chinese characters are lucky and get through as group
        String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
        components.add(startOrInBetween);
    }
    components.add(componentMatcher.group(1));
    endOfLastMatch = componentMatcher.end();
}

if (endOfLastMatch != test.length()) {
    String end = test.substring(endOfLastMatch, componentMatcher.start());
    components.add(end);
}

System.out.println(components);

这会输出[eclipse, 福福, RCP, Ext]。转换为数组当然很简单。

答案 9 :(得分:0)

我可以确认上面ctwheels给出的正则表达式字符串([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)与Microsoft的正则表达式兼容。

我还要根据ctwheels的正则表达式提出以下替代方案,该替代方案处理数字字符:([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b)

这能够拆分字符串,例如:

  

DrivingB2BTradeIn2019Onwards

  

从2019年开始推动B2B贸易

答案 10 :(得分:0)

JavaScript解决方案

/**
 * howToDoThis ===> ["", "how", "To", "Do", "This"]
 * @param word word to be split
 */
export const splitCamelCaseWords = (word: string) => {
    if (typeof word !== 'string') return [];
    return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};