从String中删除除字母表之外的所有内容

时间:2012-03-27 18:02:11

标签: c# .net regex string linq

我想以高效方式从给定字符串中删除任何字符,但字母除外。有什么建议吗?

5 个答案:

答案 0 :(得分:9)

var result = str.Where(c => char.IsLetter(c));

我对@ KirillPolishchuk的答案非常感兴趣,所以我刚用LINQPad做了一个小基准,使用随机构建的字符串,这里是完整的代码(我不得不略微更改我的原始代码,因为它返回了IEnumerable):< / p>

void Main()
{
    TimeSpan elapsed;
    string result;

    elapsed = TheLINQWay(buildString(1000000), out result);
    Console.WriteLine("LINQ way: {0}", elapsed);

    elapsed = TheRegExWay(buildString(1000000), out result);
    Console.WriteLine("RegEx way: {0}", elapsed);
}

TimeSpan TheRegExWay(string s, out string result)
{
    Stopwatch stopw = new Stopwatch();

    stopw.Start();
    result = Regex.Replace(s, @"\P{L}", string.Empty);
    stopw.Stop();

    return stopw.Elapsed;
}

TimeSpan TheLINQWay(string s, out string result)
{
    Stopwatch stopw = new Stopwatch();

    stopw.Start();
    result = new string(s.Where(c => char.IsLetter(c)).ToArray());
    stopw.Stop();

    return stopw.Elapsed;
}

string buildString(int len)
{
    byte[] buffer = new byte[len];
    Random r = new Random((int)DateTime.Now.Ticks);

    for(int i = 0; i < len; i++)
        buffer[i] = (byte)r.Next(256);

    return Encoding.ASCII.GetString(buffer);
}

这是结果:

LINQ way: 00:00:00.0150030
RegEx way: 00:00:00.2788130

但仍然需要说一句话:正如Servy在评论中指出的那样,正则表达式更短,字符串更短。

答案 1 :(得分:6)

使用:

var result = Regex.Replace(input, @"\P{L}", string.Empty);

答案 2 :(得分:2)

我能想到的最有效的方式:

string input = "ABCD 13 ~";

// at worst, all characters are alphabetical, so we have to accommodate for that
char[] output = new char[input.Length];

int numberOfAlphabeticals = 0;
for (int i = 0; i < input.Length; i++)
{
    char character = input[i];
    var charCode = (byte) character;

    // based on ASCII 
    if ((charCode >= 65 && charCode <= 90) || (charCode >= 97 && charCode <= 122))
    {
        output[numberOfAlphabeticals ] = character;
        ++numberOfAlphabeticals ;
    }
}

string outputAsString = new string(output, 0, numberOfAlphabeticals );

答案 3 :(得分:1)

我认为这是创建122个字符数组的最快方法(性能方面),将选择的字符串转换为字节数组并使用StringBuilder构建另一个字符串,其中删除了字符:

private static char[] alphabet = {'\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\0', '\0', '\0', '\0', '\0', '\0', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',};

这是删除功能(没有编译它,但它应该给你的想法):

string RemoveNonAlpha(string value)
{
    byte[] asciiBytes = Encoding.ASCII.GetBytes(value);
    StringBuilder sb = new StringBuilder();
    for(int i = 0; i < asciiBytes.Length; i++)
    {
        if((asciiBytes[i] >= 65 && asciiBytes[i] <= 90) || (asciiBytes[i] >= 97 && asciiBytes[i] <= 122))
        {
            sb.Append(alphabet[asciiBytes[i]]);
        }
    }

    return sb.ToString();
}

更新

基于Nikola's answer,这是一个改进版本:

private static string RemoveNonAlpha(string value)
{
    char[] output = new char[value.Length];
    int numAlpha = 0;
    byte charCode = 0;
    for (int i = 0; i < value.Length; i++)
    {
        charCode = (byte)value[i];
        if ((charCode >= 65 && charCode <= 90) || (charCode >= 97 && charCode <= 122))
        {
            output[numAlpha] = value[i];
            numAlpha++;
        }
    }

    return new string(output, 0, numAlpha);
}

以下是使用LINQ的结果:

The LINQ way 100: 6.7935
The fast way 100: 0.4648
The LINQ way 1000: 0.0442
The fast way 1000: 0.0134
The LINQ way 10000: 0.2078
The fast way 10000: 0.143
The LINQ way 100000: 2.0617
The fast way 100000: 1.3864

答案 4 :(得分:0)

使用

^ \ W

作为正则表达式替换方法的输入

http://msdn.microsoft.com/en-us/library/xwewhkd1.aspx