我有一个很大的名字列表,必须从中提取名字首字母和姓氏。请在下面找到样本名称:
T.-P。 SU
H. H. SPRONG
G. G. VAN MEER
C. C. PERRONE CAPANO
E. C. PARKER-ATHILL
R. R. J. BALICE-GORDON
D. B. B. VAZQUEZ SANROMAN
B. B. C.陈晟J. P. BENNETT,Jr
T.-K. KUKKO-卢基扬诺夫
预期产出:
TP SU
H SPRONG
G VAN MEER
C PERRONE CAPANO
EC PARKER-ATHILL
RJ BALICE-GORDON
DB VAZQUEZ SANROMAN
JP BENNETT JR
TK KUKKO-LUKJANOV
我使用Split功能来分割它们。我们可以有更好的REGEX方式来正确解析它们。请建议。
感谢。
答案 0 :(得分:2)
以下正则表达式适用于您的示例数据:
((?:[A-Z][-. ]+)+) ([- A-Z]+(?:, \w+)?)
示例:http://www.rubular.com/r/cM87Prp2to
第1组将是第一个名称,第2组将是第二个名称。这导致以下群组,如果其中任何一个不符合您的预期,请更详细地编辑您的问题:
T.-P. SU -> (T.-P.) (SU)
H. SPRONG -> (H.) (SPRONG)
G. VAN MEER -> (G.) (VAN MEER)
C. PERRONE CAPANO -> (C.) (PERRONE CAPANO)
E. C. PARKER-ATHILL -> (E. C.) (PARKER-ATHILL)
R. J. BALICE-GORDON -> (R. J.) (BALICE-GORDON)
D. B. VAZQUEZ SANROMAN -> (D. B.) (VAZQUEZ SANROMAN)
B. P. C. CHEN -> (B. P. C.) (CHEN)
J. P. BENNETT, Jr -> (J. P.) (BENNETT, Jr)
T.-K. KUKKO-LUKJANOV -> (T.-K.) (KUKKO-LUKJANOV)
答案 1 :(得分:1)
这是我的解决方案。我的目标不是提供最简单的解决方案,而是提供一种可以采用各种(有时是奇怪的)名称格式的解决方案,并在首字母和姓氏初始(或在匿名用户的情况下)产生最佳猜测。
我也尝试用相对国际友好的方式编写它,使用unicode正则表达式,虽然我没有任何为多种外来名称(例如中文)生成首字母的经验,尽管它应该至少生成一些可用于表示人的东西,用两个字符表示。例如,用韩语给它起一个名字,例如"행운의복숭아"会产生행복,就像你预期的那样(尽管在韩国文化中这可能不是正确的方法)。
/// <summary>
/// Given a person's first and last name, we'll make our best guess to extract up to two initials, hopefully
/// representing their first and last name, skipping any middle initials, Jr/Sr/III suffixes, etc. The letters
/// will be returned together in ALL CAPS, e.g. "TW".
///
/// The way it parses names for many common styles:
///
/// Mason Zhwiti -> MZ
/// mason lowercase zhwiti -> MZ
/// Mason G Zhwiti -> MZ
/// Mason G. Zhwiti -> MZ
/// John Queue Public -> JP
/// John Q. Public, Jr. -> JP
/// John Q Public Jr. -> JP
/// Thurston Howell III -> TH
/// Thurston Howell, III -> TH
/// Malcolm X -> MX
/// A Ron -> AR
/// A A Ron -> AR
/// Madonna -> M
/// Chris O'Donnell -> CO
/// Malcolm McDowell -> MM
/// Robert "Rocky" Balboa, Sr. -> RB
/// 1Bobby 2Tables -> BT
/// Éric Ígor -> ÉÍ
/// 행운의 복숭아 -> 행복
///
/// </summary>
/// <param name="name">The full name of a person.</param>
/// <returns>One to two uppercase initials, without punctuation.</returns>
public static string ExtractInitialsFromName(string name)
{
// first remove all: punctuation, separator chars, control chars, and numbers (unicode style regexes)
string initials = Regex.Replace(name, @"[\p{P}\p{S}\p{C}\p{N}]+", "");
// Replacing all possible whitespace/separator characters (unicode style), with a single, regular ascii space.
initials = Regex.Replace(initials, @"\p{Z}+", " ");
// Remove all Sr, Jr, I, II, III, IV, V, VI, VII, VIII, IX at the end of names
initials = Regex.Replace(initials.Trim(), @"\s+(?:[JS]R|I{1,3}|I[VX]|VI{0,3})$", "", RegexOptions.IgnoreCase);
// Extract up to 2 initials from the remaining cleaned name.
initials = Regex.Replace(initials, @"^(\p{L})[^\s]*(?:\s+(?:\p{L}+\s+(?=\p{L}))?(?:(\p{L})\p{L}*)?)?$", "$1$2").Trim();
if (initials.Length > 2)
{
// Worst case scenario, everything failed, just grab the first two letters of what we have left.
initials = initials.Substring(0, 2);
}
return initials.ToUpperInvariant();
}