我试图清理这个名称和电子邮件地址的非常嘈杂(由于OCR)数据集,一个问题是一个条目中的多个名称,例如
"Fenner, Robert: Fishbume, Howard" should be "Fenner, Robert" and "Fishbume, Howard"
or "Fendrich, Karen N., Ricci, Vincent" should be "Fendrich, Karen N." and "Ricci, Vincent"
我怎样才能使用正则表达式来查找字符串用逗号或冒号分隔的条目,这些条目本身用逗号分隔,然后拆分字符串?
问题的其他变体:
"'Emily Phaup ' Ryan, Thomas M" -> "Emily Phaup", "Ryan, Thomas M"
"A Lilly, Alisia Rudd, Andrew McComb, Daniel Lisbon, David Compton"
->"A Lilly", "Alisia Rudd", "Andrew McComb", "Daniel Lisbon", "David Compton"
"Abigail.Perlmangus.pm.com Jay.Poole@us.pm.com" -> "Abigail.Perlmangus.pm.com", "Jay.Poole@us.pm.com"
还有几个。
我知道可能无法将所有这些事件分开(特别是不会意外地分隔正确的名字),但将其中一些分开肯定有帮助
编辑:我想我的问题有点过于宽泛,所以我将其缩小一点:
有没有办法找到格式为"string1,string2, string3,string4"
的字符串(字符串可以包含任何类型的字符和空格)并将它们分成两个单独的字符串:"string1,string2" and "string3,string4"
?
并且有人可以给我一些关于如何做的指示,因为我对正则表达式缺乏经验。
答案 0 :(得分:1)
我会尝试类似的东西
public static void main(String[] args) throws URISyntaxException, IOException {
String regex = "(\\w+(,|:|$)\\s*\\w+)(,|:|$)";
Pattern pattern = Pattern.compile(regex);
String [] tests = {
"Fenner, Robert: Fishbume, Howard"
,"string1, string2, string3, string4"
};
for (String test : tests) {
Matcher matcher = pattern.matcher(test);
while(matcher.find()){
System.out.println(matcher.group(1));
}
}
}
输出:
Fenner, Robert
Fishbume, Howard
string1, string2
string3, string4
这不适用于所有情况,但请回答上次修改
我所做的,是搜索任何单词字符(\ w +)后面跟着,或者:或者在字符串的末尾。其次是任何空格和其他单词字符,然后是:或:或者行尾。
正则表达式详细信息
(\w+(,|:|$)\s*\w+)(,|:|$)
1st Capturing group (\w+(,|:|$)\s*\w+)
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
3rd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string
答案 1 :(得分:0)
我诚实的建议是将一个代表性的样本带到在线Regex计算器上并一直玩,直到你可以忍受输出。
正如您所指出的那样,输入不够常规,无法真正利用正则表达式。但你至少可以把它破解一下。对于那种肮脏的东西,可能不会是一个真正完美的答案。