是否有人拥有可靠的正确案例或PCase算法(类似于UCase或Upper)?我正在寻找具有"GEORGE BURDELL"
或"george burdell"
等价值的内容,并将其转换为"George Burdell"
。
我有一个处理简单案例的简单方法。理想的是拥有可以处理诸如"O'REILLY"
之类的东西并将其转换为"O'Reilly"
的东西,但我知道这更难。
如果这简化了事情,我主要关注英语。
更新:我使用C#作为语言,但我几乎可以转换任何内容(假设存在功能)。
我同意麦当劳的痤疮是一个艰难的。我想提一下我的O'Reilly例子,但没有在原帖中。
答案 0 :(得分:18)
除非我误解了您的问题,否则我认为您不需要自己编写问题,TextInfo类可以为您完成。
using System.Globalization;
CultureInfo.InvariantCulture.TextInfo.ToTitleCase("GeOrGE bUrdEll")
将返回“George Burdell。如果涉及一些特殊规则,您可以使用自己的文化。
更新: Michael(在对此答案的评论中)指出,如果输入全部为大写,则此方法无效,因为该方法将假定它是首字母缩略词。这个天真的解决方法是在将文本提交给ToTitleCase之前.ToLower()文本。
答案 1 :(得分:9)
@Zack:我会将其作为单独的回复发布。
以下是基于kronoz帖子的示例。
void Main()
{
List<string> names = new List<string>() {
"bill o'reilly",
"johannes diderik van der waals",
"mr. moseley-williams",
"Joe VanWyck",
"mcdonald's",
"william the third",
"hrh prince charles",
"h.r.m. queen elizabeth the third",
"william gates, iii",
"pope leo xii",
"a.k. jennings"
};
names.Select(name => name.ToProperCase()).Dump();
}
// Define other methods and classes here
// http://stackoverflow.com/questions/32149/does-anyone-have-a-good-proper-case-algorithm
public static class ProperCaseHelper {
public static string ToProperCase(this string input) {
if (IsAllUpperOrAllLower(input))
{
// fix the ALL UPPERCASE or all lowercase names
return string.Join(" ", input.Split(' ').Select(word => wordToProperCase(word)));
}
else
{
// leave the CamelCase or Propercase names alone
return input;
}
}
public static bool IsAllUpperOrAllLower(this string input) {
return (input.ToLower().Equals(input) || input.ToUpper().Equals(input) );
}
private static string wordToProperCase(string word) {
if (string.IsNullOrEmpty(word)) return word;
// Standard case
string ret = capitaliseFirstLetter(word);
// Special cases:
ret = properSuffix(ret, "'"); // D'Artagnon, D'Silva
ret = properSuffix(ret, "."); // ???
ret = properSuffix(ret, "-"); // Oscar-Meyer-Weiner
ret = properSuffix(ret, "Mc"); // Scots
ret = properSuffix(ret, "Mac"); // Scots
// Special words:
ret = specialWords(ret, "van"); // Dick van Dyke
ret = specialWords(ret, "von"); // Baron von Bruin-Valt
ret = specialWords(ret, "de");
ret = specialWords(ret, "di");
ret = specialWords(ret, "da"); // Leonardo da Vinci, Eduardo da Silva
ret = specialWords(ret, "of"); // The Grand Old Duke of York
ret = specialWords(ret, "the"); // William the Conqueror
ret = specialWords(ret, "HRH"); // His/Her Royal Highness
ret = specialWords(ret, "HRM"); // His/Her Royal Majesty
ret = specialWords(ret, "H.R.H."); // His/Her Royal Highness
ret = specialWords(ret, "H.R.M."); // His/Her Royal Majesty
ret = dealWithRomanNumerals(ret); // William Gates, III
return ret;
}
private static string properSuffix(string word, string prefix) {
if(string.IsNullOrEmpty(word)) return word;
string lowerWord = word.ToLower();
string lowerPrefix = prefix.ToLower();
if (!lowerWord.Contains(lowerPrefix)) return word;
int index = lowerWord.IndexOf(lowerPrefix);
// If the search string is at the end of the word ignore.
if (index + prefix.Length == word.Length) return word;
return word.Substring(0, index) + prefix +
capitaliseFirstLetter(word.Substring(index + prefix.Length));
}
private static string specialWords(string word, string specialWord)
{
if(word.Equals(specialWord, StringComparison.InvariantCultureIgnoreCase))
{
return specialWord;
}
else
{
return word;
}
}
private static string dealWithRomanNumerals(string word)
{
List<string> ones = new List<string>() { "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX" };
List<string> tens = new List<string>() { "X", "XX", "XXX", "XL", "L", "LX", "LXX", "LXXX", "XC", "C" };
// assume nobody uses hundreds
foreach (string number in ones)
{
if (word.Equals(number, StringComparison.InvariantCultureIgnoreCase))
{
return number;
}
}
foreach (string ten in tens)
{
foreach (string one in ones)
{
if (word.Equals(ten + one, StringComparison.InvariantCultureIgnoreCase))
{
return ten + one;
}
}
}
return word;
}
private static string capitaliseFirstLetter(string word) {
return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}
}
答案 2 :(得分:4)
还有这个用于标题外壳文本的简洁Perl脚本。
http://daringfireball.net/2008/08/title_case_update
#!/usr/bin/perl # This filter changes all words to Title Caps, and attempts to be clever # about *un*capitalizing small words like a/an/the in the input. # # The list of "small words" which are not capped comes from # the New York Times Manual of Style, plus 'vs' and 'v'. # # 10 May 2008 # Original version by John Gruber: # http://daringfireball.net/2008/05/title_case # # 28 July 2008 # Re-written and much improved by Aristotle Pagaltzis: # http://plasmasturm.org/code/titlecase/ # # Full change log at __END__. # # License: http://www.opensource.org/licenses/mit-license.php # use strict; use warnings; use utf8; use open qw( :encoding(UTF-8) :std ); my @small_words = qw( (?<!q&)a an and as at(?!&t) but by en for if in of on or the to v[.]? via vs[.]? ); my $small_re = join '|', @small_words; my $apos = qr/ (?: ['’] [[:lower:]]* )? /x; while ( <> ) { s{\A\s+}{}, s{\s+\z}{}; $_ = lc $_ if not /[[:lower:]]/; s{ \b (_*) (?: ( (?<=[ ][/\\]) [[:alpha:]]+ [-_[:alpha:]/\\]+ | # file path or [-_[:alpha:]]+ [@.:] [-_[:alpha:]@.:/]+ $apos ) # URL, domain, or email | ( (?i: $small_re ) $apos ) # or small word (case-insensitive) | ( [[:alpha:]] [[:lower:]'’()\[\]{}]* $apos ) # or word w/o internal caps | ( [[:alpha:]] [[:alpha:]'’()\[\]{}]* $apos ) # or some other word ) (_*) \b }{ $1 . ( defined $2 ? $2 # preserve URL, domain, or email : defined $3 ? "\L$3" # lowercase small word : defined $4 ? "\u\L$4" # capitalize word w/o internal caps : $5 # preserve other kinds of word ) . $6 }xeg; # Exceptions for small words: capitalize at start and end of title s{ ( \A [[:punct:]]* # start of title... | [:.;?!][ ]+ # or of subsentence... | [ ]['"“‘(\[][ ]* ) # or of inserted subphrase... ( $small_re ) \b # ... followed by small word }{$1\u\L$2}xig; s{ \b ( $small_re ) # small word... (?= [[:punct:]]* \Z # ... at the end of the title... | ['"’”)\]] [ ] ) # ... or of an inserted subphrase? }{\u\L$1}xig; # Exceptions for small words in hyphenated compound words ## e.g. "in-flight" -> In-Flight s{ \b (?<! -) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (in-flight) ( $small_re ) (?= -[[:alpha:]]+) # lookahead for "-someword" }{\u\L$1}xig; ## # e.g. "Stand-in" -> "Stand-In" (Stand is already capped at this point) s{ \b (?<!…) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (stand-in) ( [[:alpha:]]+- ) # $1 = first word and hyphen, should already be properly capped ( $small_re ) # ... followed by small word (?! - ) # Negative lookahead for another '-' }{$1\u$2}xig; print "$_"; } __END__
但听起来就像你说的那样......对于人们的名字只有。
答案 3 :(得分:2)
我今天写了这篇文章,以便在我正在开发的应用程序中实现。我认为这段代码对评论非常自我解释。它在所有情况下都不是100%准确,但它可以轻松处理大部分西方名称。
示例:
mary-jane => Mary-Jane
o'brien => O'Brien
Joël VON WINTEREGG => Joël von Winteregg
jose de la acosta => Jose de la Acosta
代码是可扩展的,因为您可以将任何字符串值添加到顶部的数组以满足您的需要。请研究它并添加可能需要的任何特殊功能。
function name_title_case($str)
{
// name parts that should be lowercase in most cases
$ok_to_be_lower = array('av','af','da','dal','de','del','der','di','la','le','van','der','den','vel','von');
// name parts that should be lower even if at the beginning of a name
$always_lower = array('van', 'der');
// Create an array from the parts of the string passed in
$parts = explode(" ", mb_strtolower($str));
foreach ($parts as $part)
{
(in_array($part, $ok_to_be_lower)) ? $rules[$part] = 'nocaps' : $rules[$part] = 'caps';
}
// Determine the first part in the string
reset($rules);
$first_part = key($rules);
// Loop through and cap-or-dont-cap
foreach ($rules as $part => $rule)
{
if ($rule == 'caps')
{
// ucfirst() words and also takes into account apostrophes and hyphens like this:
// O'brien -> O'Brien || mary-kaye -> Mary-Kaye
$part = str_replace('- ','-',ucwords(str_replace('-','- ', $part)));
$c13n[] = str_replace('\' ', '\'', ucwords(str_replace('\'', '\' ', $part)));
}
else if ($part == $first_part && !in_array($part, $always_lower))
{
// If the first part of the string is ok_to_be_lower, cap it anyway
$c13n[] = ucfirst($part);
}
else
{
$c13n[] = $part;
}
}
$titleized = implode(' ', $c13n);
return trim($titleized);
}
答案 4 :(得分:2)
我做了一个https://github.com/tamtamchik/namecase的快速C#端口,它基于Lingua :: EN :: NameCase。
public static class CIQNameCase
{
static Dictionary<string, string> _exceptions = new Dictionary<string, string>
{
{@"\bMacEdo" ,"Macedo"},
{@"\bMacEvicius" ,"Macevicius"},
{@"\bMacHado" ,"Machado"},
{@"\bMacHar" ,"Machar"},
{@"\bMacHin" ,"Machin"},
{@"\bMacHlin" ,"Machlin"},
{@"\bMacIas" ,"Macias"},
{@"\bMacIulis" ,"Maciulis"},
{@"\bMacKie" ,"Mackie"},
{@"\bMacKle" ,"Mackle"},
{@"\bMacKlin" ,"Macklin"},
{@"\bMacKmin" ,"Mackmin"},
{@"\bMacQuarie" ,"Macquarie"}
};
static Dictionary<string, string> _replacements = new Dictionary<string, string>
{
{@"\bAl(?=\s+\w)" , @"al"}, // al Arabic or forename Al.
{@"\b(Bin|Binti|Binte)\b" , @"bin"}, // bin, binti, binte Arabic
{@"\bAp\b" , @"ap"}, // ap Welsh.
{@"\bBen(?=\s+\w)" , @"ben"}, // ben Hebrew or forename Ben.
{@"\bDell([ae])\b" , @"dell$1"}, // della and delle Italian.
{@"\bD([aeiou])\b" , @"d$1"}, // da, de, di Italian; du French; do Brasil
{@"\bD([ao]s)\b" , @"d$1"}, // das, dos Brasileiros
{@"\bDe([lrn])\b" , @"de$1"}, // del Italian; der/den Dutch/Flemish.
{@"\bEl\b" , @"el"}, // el Greek or El Spanish.
{@"\bLa\b" , @"la"}, // la French or La Spanish.
{@"\bL([eo])\b" , @"l$1"}, // lo Italian; le French.
{@"\bVan(?=\s+\w)" , @"van"}, // van German or forename Van.
{@"\bVon\b" , @"von"} // von Dutch/Flemish
};
static string[] _conjunctions = { "Y", "E", "I" };
static string _romanRegex = @"\b((?:[Xx]{1,3}|[Xx][Ll]|[Ll][Xx]{0,3})?(?:[Ii]{1,3}|[Ii][VvXx]|[Vv][Ii]{0,3})?)\b";
/// <summary>
/// Case a name field into its appropriate case format
/// e.g. Smith, de la Cruz, Mary-Jane, O'Brien, McTaggart
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
public static string NameCase(string nameString)
{
// Capitalize
nameString = Capitalize(nameString);
nameString = UpdateIrish(nameString);
// Fixes for "son (daughter) of" etc
foreach (var replacement in _replacements.Keys)
{
if (Regex.IsMatch(nameString, replacement))
{
Regex rgx = new Regex(replacement);
nameString = rgx.Replace(nameString, _replacements[replacement]);
}
}
nameString = UpdateRoman(nameString);
nameString = FixConjunction(nameString);
return nameString;
}
/// <summary>
/// Capitalize first letters.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string Capitalize(string nameString)
{
nameString = nameString.ToLower();
nameString = Regex.Replace(nameString, @"\b\w", x => x.ToString().ToUpper());
nameString = Regex.Replace(nameString, @"'\w\b", x => x.ToString().ToLower()); // Lowercase 's
return nameString;
}
/// <summary>
/// Update for Irish names.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateIrish(string nameString)
{
if(Regex.IsMatch(nameString, @".*?\bMac[A-Za-z^aciozj]{2,}\b") || Regex.IsMatch(nameString, @".*?\bMc"))
{
nameString = UpdateMac(nameString);
}
return nameString;
}
/// <summary>
/// Updates irish Mac & Mc.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateMac(string nameString)
{
MatchCollection matches = Regex.Matches(nameString, @"\b(Ma?c)([A-Za-z]+)");
if(matches.Count == 1 && matches[0].Groups.Count == 3)
{
string replacement = matches[0].Groups[1].Value;
replacement += matches[0].Groups[2].Value.Substring(0, 1).ToUpper();
replacement += matches[0].Groups[2].Value.Substring(1);
nameString = nameString.Replace(matches[0].Groups[0].Value, replacement);
// Now fix "Mac" exceptions
foreach (var exception in _exceptions.Keys)
{
nameString = Regex.Replace(nameString, exception, _exceptions[exception]);
}
}
return nameString;
}
/// <summary>
/// Fix roman numeral names.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateRoman(string nameString)
{
MatchCollection matches = Regex.Matches(nameString, _romanRegex);
if (matches.Count > 1)
{
foreach(Match match in matches)
{
if(!string.IsNullOrEmpty(match.Value))
{
nameString = Regex.Replace(nameString, match.Value, x => x.ToString().ToUpper());
}
}
}
return nameString;
}
/// <summary>
/// Fix Spanish conjunctions.
/// </summary>
/// <param name=""></param>
/// <returns></returns>
private static string FixConjunction(string nameString)
{
foreach (var conjunction in _conjunctions)
{
nameString = Regex.Replace(nameString, @"\b" + conjunction + @"\b", x => x.ToString().ToLower());
}
return nameString;
}
}
用法
string name_cased = CIQNameCase.NameCase("McCarthy");
这是我的测试方法,一切似乎都通过了确定:
[TestMethod]
public void Test_NameCase_1()
{
string[] names = {
"Keith", "Yuri's", "Leigh-Williams", "McCarthy",
// Mac exceptions
"Machin", "Machlin", "Machar",
"Mackle", "Macklin", "Mackie",
"Macquarie", "Machado", "Macevicius",
"Maciulis", "Macias", "MacMurdo",
// General
"O'Callaghan", "St. John", "von Streit",
"van Dyke", "Van", "ap Llwyd Dafydd",
"al Fahd", "Al",
"el Grecco",
"ben Gurion", "Ben",
"da Vinci",
"di Caprio", "du Pont", "de Legate",
"del Crond", "der Sind", "van der Post", "van den Thillart",
"von Trapp", "la Poisson", "le Figaro",
"Mack Knife", "Dougal MacDonald",
"Ruiz y Picasso", "Dato e Iradier", "Mas i Gavarró",
// Roman numerals
"Henry VIII", "Louis III", "Louis XIV",
"Charles II", "Fred XLIX", "Yusof bin Ishak",
};
foreach(string name in names)
{
string name_upper = name.ToUpper();
string name_cased = CIQNameCase.NameCase(name_upper);
Console.WriteLine(string.Format("name: {0} -> {1} -> {2}", name, name_upper, name_cased));
Assert.IsTrue(name == name_cased);
}
}
答案 5 :(得分:1)
您使用什么编程语言?许多语言允许正则表达式匹配的回调函数。这些可以用来轻松地设置匹配。将使用的正则表达式非常简单,您只需要匹配所有单词字符,如下所示:
/\w+/
或者,您可以将第一个字符提取为额外匹配:
/(\w)(\w*)/
现在,您可以分别访问匹配中的第一个字符和连续字符。然后,回调函数可以简单地返回命中的串联。在伪Python中(我实际上并不知道Python):
def make_proper(match):
return match[1].to_upper + match[2]
顺便提一下,这也会处理“O'Reilly”的情况,因为“O”和“Reilly”会分开匹配并且都是正确的。然而,该算法没有很好地处理其他特殊情况,例如“麦当劳”或一般任何撇号词。该算法将为后者生成“麦当劳”。可以实施对撇号的特殊处理,但这会干扰第一种情况。寻找一个完美的解决方案是不可能的。在实践中,它可能有助于考虑撇号后部件的长度。
答案 6 :(得分:1)
这可能是一个天真的C#实现: -
public class ProperCaseHelper {
public string ToProperCase(string input) {
string ret = string.Empty;
var words = input.Split(' ');
for (int i = 0; i < words.Length; ++i) {
ret += wordToProperCase(words[i]);
if (i < words.Length - 1) ret += " ";
}
return ret;
}
private string wordToProperCase(string word) {
if (string.IsNullOrEmpty(word)) return word;
// Standard case
string ret = capitaliseFirstLetter(word);
// Special cases:
ret = properSuffix(ret, "'");
ret = properSuffix(ret, ".");
ret = properSuffix(ret, "Mc");
ret = properSuffix(ret, "Mac");
return ret;
}
private string properSuffix(string word, string prefix) {
if(string.IsNullOrEmpty(word)) return word;
string lowerWord = word.ToLower(), lowerPrefix = prefix.ToLower();
if (!lowerWord.Contains(lowerPrefix)) return word;
int index = lowerWord.IndexOf(lowerPrefix);
// If the search string is at the end of the word ignore.
if (index + prefix.Length == word.Length) return word;
return word.Substring(0, index) + prefix +
capitaliseFirstLetter(word.Substring(index + prefix.Length));
}
private string capitaliseFirstLetter(string word) {
return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}
}
答案 7 :(得分:0)
将每个单词的第一个字母大写(用空格分隔)的简单方法
$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
$result .= “$s “;
}
$string = trim($result);
在抓住你给出的“O'REILLY”例子方面 在两个空格上拆分字符串并且'不会起作用,因为它会使在撇号之后出现的任何字母大写,即Fred中的s
所以我可能会尝试像
这样的东西$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
if (substr($s, 0, 2) === "o'"){
$s = substr_replace($s, strtoupper(substr($s, 0, 3)), 0, 3);
}else{
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
}
$result .= “$s “;
}
$string = trim($result);
这应该赶上O'Reilly,O'Clock,O'Donnell等希望它有所帮助
请注意此代码未经测试。
答案 8 :(得分:0)
Kronoz,谢谢。我在你的函数中发现了这一行:
`if (!lowerWord.Contains(lowerPrefix)) return word`;
必须说
if (!lowerWord.StartsWith(lowerPrefix)) return word;
所以“información”不会改为“InforMacIón”
最好的,
恩里克
答案 9 :(得分:0)
我将它用作文本框的textchanged事件处理程序。支持“麦当劳”的进入
Public Shared Function DoProperCaseConvert(ByVal str As String, Optional ByVal allowCapital As Boolean = True) As String
Dim strCon As String = ""
Dim wordbreak As String = " ,.1234567890;/\-()#$%^&*€!~+=@"
Dim nextShouldBeCapital As Boolean = True
'Improve to recognize all caps input
'If str.Equals(str.ToUpper) Then
' str = str.ToLower
'End If
For Each s As Char In str.ToCharArray
If allowCapital Then
strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s)
Else
strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s.ToLower)
End If
If wordbreak.Contains(s.ToString) Then
nextShouldBeCapital = True
Else
nextShouldBeCapital = False
End If
Next
Return strCon
End Function
答案 10 :(得分:0)
这里有很多好的答案。我很简单,只考虑我们组织中的名字。您可以根据需要进行扩展。这不是一个完美的解决方案,并将温哥华改为VanCouver,这是错误的。如果您使用它,请调整它。
这是我在C#中的解决方案。这会将名称硬编码到程序中,但只需要做一些工作就可以将文本文件保存在程序之外并读取名称例外(即Van,Mc,Mac)并循环浏览它们。
public static String toProperName(String name)
{
if (name != null)
{
if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc") // Changes mcdonald to "McDonald"
return "Mc" + Regex.Replace(name.ToLower().Substring(2), @"\b[a-z]", m => m.Value.ToUpper());
if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van") // Changes vanwinkle to "VanWinkle"
return "Van" + Regex.Replace(name.ToLower().Substring(3), @"\b[a-z]", m => m.Value.ToUpper());
return Regex.Replace(name.ToLower(), @"\b[a-z]", m => m.Value.ToUpper()); // Changes to title case but also fixes
// appostrophes like O'HARE or o'hare to O'Hare
}
return "";
}
答案 11 :(得分:0)
我知道这个帖子已经开放了一段时间,但是当我正在研究这个问题时,我遇到了这个漂亮的网站,它允许你很快地粘贴名称:https://dialect.ca/code/name-case/。我想把它包括在这里供其他人做类似研究/项目的参考。
他们通过以下链接发布他们用PHP编写的算法:https://dialect.ca/code/name-case/name_case.phps
初步测试和阅读他们的代码表明他们已经非常彻底。
答案 12 :(得分:-1)
你没有提到你想要解决方案的语言,所以这里有一些伪代码。
Loop through each character
If the previous character was an alphabet letter
Make the character lower case
Otherwise
Make the character upper case
End loop