我正在尝试比较没有任何标点,空格,重音等的名称。 目前我正在做以下事情:
-(NSString*) prepareString:(NSString*)a {
//remove any accents and punctuation;
a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];
a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
a=[a lowercaseString];
return a;
}
但是,我需要为数百个字符串执行此操作,我需要提高效率。有什么想法吗?
答案 0 :(得分:80)
NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
答案 1 :(得分:39)
在使用任何这些解决方案之前,请不要忘记使用decomposedStringWithCanonicalMapping
来分解任何带重音的字母。例如,这将把é(U + 00E9)变成e(U + 0065 U + 0301)。然后,当您去掉非字母数字字符时,非重音字母将保留。
这很重要的原因是你可能不希望,例如,“dän”和“dün”*被视为相同。如果你删除了所有重音字母,正如其中一些解决方案可能会做的那样,你最终会得到“dn”,所以这些字符串将相等。
所以,你应该首先分解它们,这样你就可以去除重音并留下字母。
*来自德语的例子。感谢Joris Weimar提供它。
答案 2 :(得分:14)
On a similar question, Ole Begemann suggests using stringByFoldingWithOptions:我相信这是最好的解决方案:
NSString *accentedString = @"ÁlgeBra"; NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];
根据要转换的字符串的性质,您可能希望设置固定的区域设置(例如英语),而不是使用用户的当前区域设置。这样,您可以确保在每台机器上获得相同的结果。
答案 3 :(得分:7)
BillyTheKid18756的答案有一个重要的精确度(由Luiz修正,但在代码的解释中并不明显):
不要使用 stringWithCString
作为删除重音的第二步,它可以在字符串的末尾添加不需要的字符,因为NSData不是以NULL结尾的(因为stringWithCString需要它)。
或者使用它并向NSData添加一个额外的NULL字节,就像Luiz在他的代码中所做的那样。
我认为更简单的答案是替换:
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
人:
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];
如果我收回BillyTheKid18756的代码,这里是完整正确的代码:
// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";
// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];
// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];
// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
答案 4 :(得分:7)
如果您要比较字符串,请使用以下方法之一。不要试图改变数据。
- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale
您需要考虑用户区域设置来执行使用字符串写入的内容,尤其是名称之类的内容。 在大多数语言中,ä和å等字符除了看起来相似之外并不相同。它们本质上是不同的字符,其含义与其他字符不同,但实际的规则和语义对于每个语言环境都是不同的。
比较和排序字符串的正确方法是考虑用户的语言环境。 1990年的其他任何东西都是天真的,错误的。停止这样做。
如果您尝试将数据传递到不支持非ASCII的系统,那么这只是一件错误的事情。将其作为数据blob传递。
首先将字符串规范化(参见Peter Hosey的帖子)预分解或分解,基本上选择一个标准化的形式。
- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping
不,它并不像我们想象的那么简单和容易。 是的,它需要明智和谨慎的决策。 (以及一些非英语语言经验有帮助)
答案 5 :(得分:4)
考虑使用RegexKit framework。你可以这样做:
NSString *searchString = @"This is neat.";
NSString *regexString = @"[\W]";
NSString *replaceWithString = @"";
NSString *replacedString = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];
NSLog (@"%@", replacedString);
//... Thisisneat
答案 6 :(得分:4)
考虑使用NSScanner,特别是方法-setCharactersToBeSkipped:
(接受NSCharacterSet)和-scanString:intoString:
(接受字符串并通过引用返回扫描的字符串)。
您可能还希望将其与-[NSString localizedCompare:]
或-[NSString compare:options:]
NSDiacriticInsensitiveSearch选项联系起来。这可以简化必须删除/替换重音,因此您可以专注于删除puncuation,空格等。
如果你必须使用你在问题中提出的方法,至少使用NSMutableString和replaceOccurrencesOfString:withString:options:range:
- 这将比创建大量几乎相同的自动释放字符串更有效。可能只是减少分配数量将暂时提升“足够”的性能。
答案 7 :(得分:4)
要结合路易斯和彼得的答案给出一个完整的例子,添加几行,你会得到以下代码。
代码执行以下操作:
// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";
// Create set of accepted characters
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];
// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
// Remove characters not in the set
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
let text = "BûvérÈ!@$&%^&(*^(_()-*/48"
// Create set of accepted characters
let acceptedCharacters = NSMutableCharacterSet()
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.letterCharacterSet())
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.decimalDigitCharacterSet())
acceptedCharacters.addCharactersInString(" _-.!")
// Turn accented letters into normal letters (optional)
let sanitizedData = text.dataUsingEncoding(NSASCIIStringEncoding, allowLossyConversion: true)
let sanitizedText = String(data: sanitizedData!, encoding: NSASCIIStringEncoding)
// Remove characters not in the set
let components = sanitizedText!.componentsSeparatedByCharactersInSet(acceptedCharacters.invertedSet)
let output = components.joinWithSeparator("")
两个示例的输出都是: BuverE!_- 48
答案 8 :(得分:3)
刚刚碰到这个,也许为时已晚,但这对我有用:
// text is the input string, and this just removes accents from the letters
// lossy encoding turns accented letters into normal letters
NSMutableData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding
allowLossyConversion:YES];
// increase length by 1 adds a 0 byte (increaseLengthBy
// guarantees to fill the new space with 0s), effectively turning
// sanitizedData into a c-string
[sanitizedData increaseLengthBy:1];
// now we just create a string with the c-string in sanitizedData
NSString *final = [NSString stringWithCString:[sanitizedData bytes]];
答案 9 :(得分:1)
这些答案对我来说并不像预期的那样有效。具体而言,decomposedStringWithCanonicalMapping
并没有像我预期的那样去除重音符号/变音符号。
以下是我用过的内容的变体,回答了简要说明:
// replace accents, umlauts etc with equivalent letter i.e 'é' becomes 'e'.
// Always use en_GB (or a locale without the characters you wish to strip) as locale, no matter which language we're taking as input
NSString *processedString = [string stringByFoldingWithOptions: NSDiacriticInsensitiveSearch locale: [NSLocale localeWithLocaleIdentifier: @"en_GB"]];
// remove non-letters
processedString = [[processedString componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
// trim whitespace
processedString = [processedString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceCharacterSet]];
return processedString;
答案 10 :(得分:0)
彼得在斯威夫特的解决方案:
Windows.Storage.Pickers.FolderPicker folderPicker = new Windows.Storage.Pickers.FolderPicker();
folderPicker.ViewMode = Windows.Storage.Pickers.PickerViewMode.Thumbnail;
folderPicker.FileTypeFilter.Add(".txt");
StorageFolder folder = await folderPicker.PickSingleFolderAsync();
if (folder != null)
{
//do
}
示例:
let newString = oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
答案 11 :(得分:-1)
我想过滤掉除字母和数字之外的所有内容,因此我将Lorean在NSString上的类别实现调整为有点不同。在此示例中,您指定的字符串仅包含您要保留的字符,其他所有内容都将被过滤掉:
@interface NSString (PraxCategories)
+ (NSString *)lettersAndNumbers;
- (NSString*)stringByKeepingOnlyLettersAndNumbers;
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string;
@end
@implementation NSString (PraxCategories)
+ (NSString *)lettersAndNumbers { return @"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; }
- (NSString*)stringByKeepingOnlyLettersAndNumbers {
return [self stringByKeepingOnlyCharactersInString:[NSString lettersAndNumbers]];
}
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string {
NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
NSMutableString * mutableString = @"".mutableCopy;
for (int i = 0; i < [self length]; i++){
char character = [self characterAtIndex:i];
if([characterSet characterIsMember:character]) [mutableString appendFormat:@"%c", character];
}
return mutableString.copy;
}
@end
一旦您制作了类别,使用它们是微不足道的,您可以在 任何 NSString上使用它们:
NSString *string = someStringValueThatYouWantToFilter;
string = [string stringByKeepingOnlyLettersAndNumbers];
或者,例如,如果你想摆脱除元音之外的所有事情:
string = [string stringByKeepingOnlyCharactersInString:@"aeiouAEIOU"];
如果您仍然在学习Objective-C并且没有使用类别,我建议您试用它们。它们是放置这类内容的最佳位置,因为它为您分类的类的所有对象提供了更多功能。
类别简化并封装了您添加的代码,使您可以轻松地在所有项目中重复使用。这是Objective-C的一大特色!