NSString - 仅转换为纯字母(即删除重音符号+标点符号)

时间:2009-08-05 07:57:58

标签: objective-c regex cocoa string nsstring

我正在尝试比较没有任何标点,空格,重音等的名称。 目前我正在做以下事情:

-(NSString*) prepareString:(NSString*)a {
    //remove any accents and punctuation;
    a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];

    a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
    a=[a lowercaseString];
    return a;
}

但是,我需要为数百个字符串执行此操作,我需要提高效率。有什么想法吗?

12 个答案:

答案 0 :(得分:80)

NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];

答案 1 :(得分:39)

在使用任何这些解决方案之前,请不要忘记使用decomposedStringWithCanonicalMapping来分解任何带重音的字母。例如,这将把é(U + 00E9)变成e(U + 0065 U + 0301)。然后,当您去掉非字母数字字符时,非重音字母将保留。

这很重要的原因是你可能不希望,例如,“dän”和“dün”*被视为相同。如果你删除了所有重音字母,正如其中一些解决方案可能会做的那样,你最终会得到“dn”,所以这些字符串将相等。

所以,你应该首先分解它们,这样你就可以去除重音并留下字母。

*来自德语的例子。感谢Joris Weimar提供它。

答案 2 :(得分:14)

On a similar question, Ole Begemann suggests using stringByFoldingWithOptions:我相信这是最好的解决方案:

NSString *accentedString = @"ÁlgeBra";
NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];
     

根据要转换的字符串的性质,您可能希望设置固定的区域设置(例如英语),而不是使用用户的当前区域设置。这样,您可以确保在每台机器上获得相同的结果。

答案 3 :(得分:7)

BillyTheKid18756的答案有一个重要的精确度(由Luiz修正,但在代码的解释中并不明显):

不要使用 stringWithCString作为删除重音的第二步,它可以在字符串的末尾添加不需要的字符,因为NSData不是以NULL结尾的(因为stringWithCString需要它)。 或者使用它并向NSData添加一个额外的NULL字节,就像Luiz在他的代码中所做的那样。

我认为更简单的答案是替换:

NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];

人:

NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

如果我收回BillyTheKid18756的代码,这里是完整正确的代码:

// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";

// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];

// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];

答案 4 :(得分:7)

如果您要比较字符串,请使用以下方法之一。不要试图改变数据。

- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale

您需要考虑用户区域设置来执行使用字符串写入的内容,尤其是名称之类的内容。 在大多数语言中,ä和å等字符除了看起来相似之外并不相同。它们本质上是不同的字符,其含义与其他字符不同,但实际的规则和语义对于每个语言环境都是不同的。

比较和排序字符串的正确方法是考虑用户的语言环境。 1990年的其他任何东西都是天真的,错误的。停止这样做。

如果您尝试将数据传递到不支持非ASCII的系统,那么这只是一件错误的事情。将其作为数据blob传递。

https://developer.apple.com/library/ios/documentation/cocoa/Conceptual/Strings/Articles/SearchingStrings.html

首先将字符串规范化(参见Peter Hosey的帖子)预分解或分解,基本上选择一个标准化的形式。

- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping

不,它并不像我们想象的那么简单和容易。 是的,它需要明智和谨慎的决策。 (以及一些非英语语言经验有帮助)

答案 5 :(得分:4)

考虑使用RegexKit framework。你可以这样做:

NSString *searchString      = @"This is neat.";
NSString *regexString       = @"[\W]";
NSString *replaceWithString = @"";
NSString *replacedString    = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];

NSLog (@"%@", replacedString);
//... Thisisneat

答案 6 :(得分:4)

考虑使用NSScanner,特别是方法-setCharactersToBeSkipped:(接受NSCharacterSet)和-scanString:intoString:(接受字符串并通过引用返回扫描的字符串)。

您可能还希望将其与-[NSString localizedCompare:]-[NSString compare:options:] NSDiacriticInsensitiveSearch选项联系起来。这可以简化必须删除/替换重音,因此您可以专注于删除puncuation,空格等。

如果你必须使用你在问题中提出的方法,至少使用NSMutableString和replaceOccurrencesOfString:withString:options:range: - 这将比创建大量几乎相同的自动释放字符串更有效。可能只是减少分配数量将暂时提升“足够”的性能。

答案 7 :(得分:4)

要结合路易斯和彼得的答案给出一个完整的例子,添加几行,你会得到以下代码。

代码执行以下操作:

  1. 创建一组可接受的字符
  2. 将重音字母转换为普通字母
  3. 删除不在集合
  4. 中的字符

    目标C

    // The input text
    NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";
    
    // Create set of accepted characters
    NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
    [acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
    [acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
    [acceptedCharacters addCharactersInString:@" _-.!"];
    
    // Turn accented letters into normal letters (optional)
    NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
    NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
    
    // Remove characters not in the set
    NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
    

    Swift(2.2)示例

    let text = "BûvérÈ!@$&%^&(*^(_()-*/48"
    
    // Create set of accepted characters
    let acceptedCharacters = NSMutableCharacterSet()
    acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.letterCharacterSet())
    acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.decimalDigitCharacterSet())
    acceptedCharacters.addCharactersInString(" _-.!")
    
    // Turn accented letters into normal letters (optional)
    let sanitizedData = text.dataUsingEncoding(NSASCIIStringEncoding, allowLossyConversion: true)
    let sanitizedText = String(data: sanitizedData!, encoding: NSASCIIStringEncoding)
    
    // Remove characters not in the set
    let components = sanitizedText!.componentsSeparatedByCharactersInSet(acceptedCharacters.invertedSet)
    let output = components.joinWithSeparator("")
    

    输出

    两个示例的输出都是: BuverE!_- 48

答案 8 :(得分:3)

刚刚碰到这个,也许为时已晚,但这对我有用:

// text is the input string, and this just removes accents from the letters

// lossy encoding turns accented letters into normal letters
NSMutableData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding
                                  allowLossyConversion:YES];

// increase length by 1 adds a 0 byte (increaseLengthBy 
// guarantees to fill the new space with 0s), effectively turning 
// sanitizedData into a c-string
[sanitizedData increaseLengthBy:1];

// now we just create a string with the c-string in sanitizedData
NSString *final = [NSString stringWithCString:[sanitizedData bytes]];

答案 9 :(得分:1)

这些答案对我来说并不像预期的那样有效。具体而言,decomposedStringWithCanonicalMapping并没有像我预期的那样去除重音符号/变音符号。

以下是我用过的内容的变体,回答了简要说明:

// replace accents, umlauts etc with equivalent letter i.e 'é' becomes 'e'.
// Always use en_GB (or a locale without the characters you wish to strip) as locale, no matter which language we're taking as input
NSString *processedString = [string stringByFoldingWithOptions: NSDiacriticInsensitiveSearch locale: [NSLocale localeWithLocaleIdentifier: @"en_GB"]];
// remove non-letters
processedString = [[processedString componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
// trim whitespace
processedString = [processedString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceCharacterSet]];
return processedString;

答案 10 :(得分:0)

彼得在斯威夫特的解决方案:

Windows.Storage.Pickers.FolderPicker folderPicker = new Windows.Storage.Pickers.FolderPicker();
folderPicker.ViewMode = Windows.Storage.Pickers.PickerViewMode.Thumbnail;
folderPicker.FileTypeFilter.Add(".txt");
StorageFolder folder = await folderPicker.PickSingleFolderAsync();
if (folder != null)           
{                            
     //do            
}

示例:

let newString = oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")

答案 11 :(得分:-1)

我想过滤掉除字母和数字之外的所有内容,因此我将Lorean在NSString上的类别实现调整为有点不同。在此示例中,您指定的字符串仅包含您要保留的字符,其他所有内容都将被过滤掉:

@interface NSString (PraxCategories)
+ (NSString *)lettersAndNumbers;
- (NSString*)stringByKeepingOnlyLettersAndNumbers;
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string;
@end


@implementation NSString (PraxCategories)

+ (NSString *)lettersAndNumbers { return @"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; }

- (NSString*)stringByKeepingOnlyLettersAndNumbers {
    return [self stringByKeepingOnlyCharactersInString:[NSString lettersAndNumbers]];
}

- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string {
    NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
    NSMutableString * mutableString = @"".mutableCopy;
    for (int i = 0; i < [self length]; i++){
        char character = [self characterAtIndex:i];
        if([characterSet characterIsMember:character]) [mutableString appendFormat:@"%c", character];
    }
    return mutableString.copy;
}

@end

一旦您制作了类别,使用它们是微不足道的,您可以在 任何 NSString上使用它们:

NSString *string = someStringValueThatYouWantToFilter;

string = [string stringByKeepingOnlyLettersAndNumbers];

或者,例如,如果你想摆脱除元音之外的所有事情:

string = [string stringByKeepingOnlyCharactersInString:@"aeiouAEIOU"];

如果您仍然在学习Objective-C并且没有使用类别,我建议您试用它们。它们是放置这类内容的最佳位置,因为它为您分类的类的所有对象提供了更多功能。

类别简化并封装了您添加的代码,使您可以轻松地在所有项目中重复使用。这是Objective-C的一大特色!