执行多个字符串替换的更快方法

时间:2010-11-11 14:28:26

标签: c# regex

我需要做以下事情:

    static string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
    static string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
    static int i = pats.Length;
    int j;

     // function for the replacement(s)
     public string DoRepl(string Inp) {
      string tmp = Inp;
        for( j = 0; j < i; j++ ) {
            tmp = Regex.Replace(tmp,pats[j],repl[j]);
        }
        return tmp.ToString();            
    }
    /* Main flow processes about 45000 lines of input */

每一行都有6个通过DoRepl的元素。大约300,000个函数调用。每个都有20个Regex.Replace,总计约600万个替换。

是否有更多优雅方式可以在更少的传递中执行此操作?

8 个答案:

答案 0 :(得分:21)

static Dictionary<char, char> repl = new Dictionary<char, char>() { { 'å', 'a' }, { 'ø', 'o' } }; // etc...
public string DoRepl(string Inp)
{
    var tmp = Inp.Select(c =>
    {
        char r;
        if (repl.TryGetValue(c, out r))
            return r;
        return c;
    });
    return new string(tmp.ToArray());
}

每个字符只对字典进行一次检查,如果在字典中找到则替换。

答案 1 :(得分:12)

这个“诡计”怎么样?

string conv = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(input));

答案 2 :(得分:10)

如果没有正则表达式,可能会更快。

    for( j = 0; j < i; j++ ) 
    {
        tmp = tmp.Replace(pats[j], repl[j]);
    }

修改

使用ZipStringBuilder的另一种方式:

StringBuilder result = new StringBuilder(input);
foreach (var zipped = patterns.Zip(replacements, (p, r) => new {p, r}))
{
  result = result.Replace(zipped.p, zipped.r);
}
return result.ToString();

答案 3 :(得分:3)

首先,我会使用StringBuilder在缓冲区内执行翻译,避免在整个地方创建新字符串。

接下来,理想情况下,我们想要类似于XPathtranslate(),因此我们可以使用字符串而不是数组或映射。我们在extension method

中这样做
public static StringBuilder Translate(this StringBuilder builder,
    string inChars, string outChars)
{
    int length = Math.Min(inChars.Length, outChars.Length);
    for (int i = 0; i < length; ++i) {
        builder.Replace(inChars[i], outChars[i]);
    }
    return builder;
}

然后使用它:

StringBuilder builder = new StringBuilder(yourString);
yourString = builder.Translate("åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ",
    "aAaAaAoOoOeEaAiIoOiI").ToString();

答案 4 :(得分:2)

原始正则表达式的问题在于您没有充分利用它。请记住,正则表达式模式可以有变化。你仍然需要一本字典,但是你可以在一次通过中完成它而不会遍历每个字符。

这将实现如下:

string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
// using Zip as a shortcut, otherwise setup dictionary differently as others have shown
var dict = pats.Zip(repl, (k,v) => new { Key = k, Value = v }).ToDictionary(o => o.Key, o => o.Value);

string input = "åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ";
string pattern = String.Join("|", dict.Keys.Select(k => k)); // use ToArray() for .NET 3.5
string result = Regex.Replace(input, pattern, m => dict[m.Value]);

Console.WriteLine("Pattern: " + pattern);
Console.WriteLine("Input: " + input);
Console.WriteLine("Result: " + result);

当然,您应该始终使用Regex.Escape来逃避您的模式。在这种情况下,这是不需要的,因为我们知道有限的字符集,并且它们不需要被转义。

答案 5 :(得分:1)

如果你想删除重音,那么这个解决方案可能会有所帮助How do I remove diacritics (accents) from a string in .NET?

否则我会以单程传递:

Dictionary<char, char> replacements = new Dictionary<char, char>();
...
StringBuilder result = new StringBuilder();
foreach(char c in str)
{
  char rc;
  if (!_replacements.TryGetValue(c, out rc)
  {
    rc = c;
  }
  result.Append(rc);
}

答案 6 :(得分:1)

在一对一字符替换的特殊情况下,最快(恕我直言)方式(甚至与字典相比)将是一个完整的字符映射:

public class Converter
{
    private readonly char[] _map;

    public Converter()
    {
        // This code assumes char to be a short unsigned integer
        _map = new char[char.MaxValue];

        for (int i = 0; i < _map.Length; i++)
            _map[i] = (char)i;

        _map['å'] = 'a';  // Note that 'å' is used as an integer index into the array.
        _map['Å'] = 'A';
        _map['æ'] = 'a';
        // ... the rest of overriding map
    }

    public string Convert(string source)
    {
        if (string.IsNullOrEmpty(source))
            return source;

        var result = new char[source.Length];

        for (int i = 0; i < source.Length; i++)
            result[i] = _map[source[i]]; // convert using the map

        return new string(result);
    }
}

要进一步加快此代码的速度,您可能需要使用“unsafe”关键字并使用指针。这样,遍历字符串数组可以更快地完成,而不需要绑定检查(理论上它将被VM优化,但可能不会)。

答案 7 :(得分:0)

我不熟悉Regex类,但是大多数正则表达式引擎都有一个音译操作,可以在这里运行良好。那么你每行只需要一个电话。