Java中的Split字符串

时间:2015-03-22 08:33:54

标签: java regex split

我试图清理这个名称和电子邮件地址的非常嘈杂(由于OCR)数据集,一个问题是一个条目中的多个名称,例如

"Fenner, Robert: Fishbume, Howard" should be "Fenner, Robert" and "Fishbume, Howard"

or "Fendrich, Karen N., Ricci, Vincent" should be "Fendrich, Karen N." and "Ricci, Vincent" 

我怎样才能使用正则表达式来查找字符串用逗号或冒号分隔的条目,这些条目本身用逗号分隔,然后拆分字符串?

问题的其他变体:

"'Emily Phaup ' Ryan, Thomas M" -> "Emily Phaup", "Ryan, Thomas M"

"A Lilly, Alisia Rudd, Andrew McComb, Daniel Lisbon, David Compton"
->"A Lilly", "Alisia Rudd", "Andrew McComb", "Daniel Lisbon", "David Compton"

"Abigail.Perlmangus.pm.com  Jay.Poole@us.pm.com" -> "Abigail.Perlmangus.pm.com", "Jay.Poole@us.pm.com"

还有几个。

我知道可能无法将所有这些事件分开(特别是不会意外地分隔正确的名字),但将其中一些分开肯定有帮助

编辑:我想我的问题有点过于宽泛,所以我将其缩小一点:
有没有办法找到格式为"string1,string2, string3,string4"的字符串(字符串可以包含任何类型的字符和空格)并将它们分成两个单独的字符串:"string1,string2" and "string3,string4"
并且有人可以给我一些关于如何做的指示,因为我对正则表达式缺乏经验。

2 个答案:

答案 0 :(得分:1)

我会尝试类似的东西

public static void main(String[] args) throws URISyntaxException, IOException {

    String regex = "(\\w+(,|:|$)\\s*\\w+)(,|:|$)";
    Pattern pattern = Pattern.compile(regex);

    String [] tests = {
            "Fenner, Robert: Fishbume, Howard"
            ,"string1, string2, string3, string4"
    };

    for (String test : tests) {
        Matcher matcher = pattern.matcher(test);
        while(matcher.find()){
            System.out.println(matcher.group(1));
        }

    }
}

输出:

Fenner, Robert
Fishbume, Howard
string1, string2
string3, string4

这不适用于所有情况,但请回答上次修改


我所做的,是搜索任何单词字符(\ w +)后面跟着,或者:或者在字符串的末尾。其次是任何空格和其他单词字符,然后是:或:或者行尾。


正则表达式详细信息

(\w+(,|:|$)\s*\w+)(,|:|$)
1st Capturing group (\w+(,|:|$)\s*\w+)
    \w+ match any word character [a-zA-Z0-9_]
        Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (,|:|$)
    1st Alternative: ,
        , matches the character , literally
    2nd Alternative: :
        : matches the character : literally
    3rd Alternative: $
        $ assert position at end of the string
\s* match any white space character [\r\n\t\f ]
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
3rd Capturing group (,|:|$)
    1st Alternative: ,
        , matches the character , literally
    2nd Alternative: :
        : matches the character : literally
    3rd Alternative: $
        $ assert position at end of the string

答案 1 :(得分:0)

我诚实的建议是将一个代表性的样本带到在线Regex计算器上并一直玩,直到你可以忍受输出。

正如您所指出的那样,输入不够常规,无法真正利用正则表达式。但你至少可以把它破解一下。对于那种肮脏的东西,可能不会是一个真正完美的答案。