Question

我试图清理这个名称和电子邮件地址的非常嘈杂（由于OCR）数据集，一个问题是一个条目中的多个名称，例如

"Fenner, Robert: Fishbume, Howard" should be "Fenner, Robert" and "Fishbume, Howard"

or "Fendrich, Karen N., Ricci, Vincent" should be "Fendrich, Karen N." and "Ricci, Vincent"

我怎样才能使用正则表达式来查找字符串用逗号或冒号分隔的条目，这些条目本身用逗号分隔，然后拆分字符串？

问题的其他变体：

"'Emily Phaup ' Ryan, Thomas M" -> "Emily Phaup", "Ryan, Thomas M"

"A Lilly, Alisia Rudd, Andrew McComb, Daniel Lisbon, David Compton"
->"A Lilly", "Alisia Rudd", "Andrew McComb", "Daniel Lisbon", "David Compton"

"Abigail.Perlmangus.pm.com  Jay.Poole@us.pm.com" -> "Abigail.Perlmangus.pm.com", "Jay.Poole@us.pm.com"

还有几个。

我知道可能无法将所有这些事件分开（特别是不会意外地分隔正确的名字），但将其中一些分开肯定有帮助

编辑：我想我的问题有点过于宽泛，所以我将其缩小一点：
有没有办法找到格式为"string1,string2, string3,string4"的字符串（字符串可以包含任何类型的字符和空格）并将它们分成两个单独的字符串："string1,string2" and "string3,string4"？
并且有人可以给我一些关于如何做的指示，因为我对正则表达式缺乏经验。

Answer 1

我会尝试类似的东西

public static void main(String[] args) throws URISyntaxException, IOException {

    String regex = "(\\w+(,|:|$)\\s*\\w+)(,|:|$)";
    Pattern pattern = Pattern.compile(regex);

    String [] tests = {
            "Fenner, Robert: Fishbume, Howard"
            ,"string1, string2, string3, string4"
    };

    for (String test : tests) {
        Matcher matcher = pattern.matcher(test);
        while(matcher.find()){
            System.out.println(matcher.group(1));
        }

    }
}

输出：

Fenner, Robert
Fishbume, Howard
string1, string2
string3, string4

这不适用于所有情况，但请回答上次修改

我所做的，是搜索任何单词字符（\ w +）后面跟着，或者：或者在字符串的末尾。其次是任何空格和其他单词字符，然后是：或：或者行尾。

正则表达式详细信息

(\w+(,|:|$)\s*\w+)(,|:|$)
1st Capturing group (\w+(,|:|$)\s*\w+)
    \w+ match any word character [a-zA-Z0-9_]
        Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (,|:|$)
    1st Alternative: ,
        , matches the character , literally
    2nd Alternative: :
        : matches the character : literally
    3rd Alternative: $
        $ assert position at end of the string
\s* match any white space character [\r\n\t\f ]
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
3rd Capturing group (,|:|$)
    1st Alternative: ,
        , matches the character , literally
    2nd Alternative: :
        : matches the character : literally
    3rd Alternative: $
        $ assert position at end of the string

Answer 2

我诚实的建议是将一个代表性的样本带到在线Regex计算器上并一直玩，直到你可以忍受输出。

正如您所指出的那样，输入不够常规，无法真正利用正则表达式。但你至少可以把它破解一下。对于那种肮脏的东西，可能不会是一个真正完美的答案。

Java中的Split字符串

2 个答案: