是否有人围绕StringBuilders或Streams实现了正则表达式和/或Xml解析器?

时间:2012-07-18 01:32:14

标签: c# regex stringbuilder

我正在建立一个压力测试客户端,它使用尽可能多的线程来锤击服务器并分析响应,客户可以集合。我经常发现自己通过垃圾收集(和/或缺乏)节流,在大多数情况下,它归结为是我唯一的实例字符串传递了他们对一个正则表达式或XML解析程序。

如果你反编译Regex类,你会看到内部,它使用StringBuilders来做几乎所有事情,但你不能传递它一个字符串构建器;在开始使用私有方法之前,它有助于深入研究私有方法,因此扩展方法也不会解决它。如果您想从System.Xml.Linq中的解析器中获取对象图,则处于类似情况。

这不是一个迂腐过度优化的案例。我查看了Regex replacements inside a StringBuilder问题以及其他问题。我还想知道我的应用程序,看看天花板的来源,现在使用Regex.Replace()确实在方法链中引入了大量开销,我试图以每小时数百万的请求命中服务器并检查错误和嵌入式诊断代码的XML响应。我已经摆脱了限制吞吐量的所有其他低效率,并且当我不需要捕获组或反向引用时,我甚至通过扩展StringBuilder来进行通配符查找/替换,从而减少了大量的Regex开销。但在我看来,现在有人会把自定义的StringBuilder(或更好的,基于Stream)的Regex和Xml解析实用程序包起来。

好的,如此咆哮,但我自己必须这样做吗?

更新:我找到了一种解决方法,可以将峰值内存消耗从几千兆字节降低到几百兆,所以我将其发布在下面。我不是把它作为答案添加因为a)我一般不喜欢这样做,而且b)我还想知道是否有人花时间自定义StringBuilder来做Regexes(反之亦然)。< / p>

在我的情况下,我无法使用XmlReader,因为我正在摄取的流包含某些元素中的一些无效二进制内容。为了解析XML,我必须清空这些元素。我以前使用单个静态编译的Regex实例进行替换,这就像疯了一样消耗内存(我正在尝试处理~300个10KB docs / sec)。大幅减少消费的变化是:

  1. 我为这个方便的IndexOf方法添加了此StringBuilder Extensions article on CodeProject的代码。
  2. 我添加了一个(非常)粗略WildcardReplace方法,允许每次调用一个通配符(*或?)
  3. 我用WildcardReplace()调用替换正则表达式用法以清空有问题元素的内容
  4. 这是非常不合适的,仅根据我自己的目的进行测试;我会让它变得更加优雅和强大,但YAGNI和所有这一切,我都很着急。这是代码:

    /// <summary>
    /// Performs basic wildcard find and replace on a string builder, observing one of two 
    /// wildcard characters: * matches any number of characters, or ? matches a single character.
    /// Operates on only one wildcard per invocation; 2 or more wildcards in <paramref name="find"/>
    /// will cause an exception.
    /// All characters in <paramref name="replaceWith"/> are treated as literal parts of 
    /// the replacement text.
    /// </summary>
    /// <param name="find"></param>
    /// <param name="replaceWith"></param>
    /// <returns></returns>
    public static StringBuilder WildcardReplace(this StringBuilder sb, string find, string replaceWith) {
        if (find.Split(new char[] { '*' }).Length > 2 || find.Split(new char[] { '?' }).Length > 2 || (find.Contains("*") && find.Contains("?"))) {
            throw new ArgumentException("Only one wildcard is supported, but more than one was supplied.", "find");
        } 
        // are we matching one character, or any number?
        bool matchOneCharacter = find.Contains("?");
        string[] parts = matchOneCharacter ? 
            find.Split(new char[] { '?' }, StringSplitOptions.RemoveEmptyEntries) 
            : find.Split(new char[] { '*' }, StringSplitOptions.RemoveEmptyEntries);
        int startItemIdx; 
        int endItemIdx;
        int newStartIdx = 0;
        int length;
        while ((startItemIdx = sb.IndexOf(parts[0], newStartIdx)) > 0 
            && (endItemIdx = sb.IndexOf(parts[1], startItemIdx + parts[0].Length)) > 0) {
            length = (endItemIdx + parts[1].Length) - startItemIdx;
            newStartIdx = startItemIdx + replaceWith.Length;
            // With "?" wildcard, find parameter length should equal the length of its match:
            if (matchOneCharacter && length > find.Length)
                break;
            sb.Remove(startItemIdx, length);
            sb.Insert(startItemIdx, replaceWith);
        }
        return sb;
    }
    

3 个答案:

答案 0 :(得分:1)

XmlReader是一个基于流的XML解析器。见http://msdn.microsoft.com/en-us/library/756wd7zs.aspx

答案 1 :(得分:1)

Mono项目有switched the license for their core libraries to an MIT X11 license。如果您需要为特定应用程序中的性能创建自定义的正则表达式库,则应该能够从Mono'sSystem library实施的最新代码开始。

答案 2 :(得分:0)

在这里尝试一下。一切都基于字符,效率相对较低。可以使用任意数量的*?。但是,您的*现在为,而您的?现在为。为了使它尽可能整洁,大约花了三天的时间。您甚至可以一次扫描输入多个查询!

用法示例:wildcard(new StringBuilder("Hello and welcome"), "hello✪w★l", "be")导致“成为”。

////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////// Search for a string/s inside 'text' using the 'find' parameter, and replace with a string/s using the replace parameter
// ✪ represents multiple wildcard characters (non-greedy)
// ★ represents a single wildcard character
public StringBuilder wildcard(StringBuilder text, string find, string replace, bool caseSensitive = false)
{
    return wildcard(text, new string[] { find }, new string[] { replace }, caseSensitive);
}
public StringBuilder wildcard(StringBuilder text, string[] find, string[] replace, bool caseSensitive = false)
{
    if (text.Length == 0) return text;          // Degenerate case

    StringBuilder sb = new StringBuilder();     // The new adjusted string with replacements
    for (int i = 0; i < text.Length; i++)   {   // Go through every letter of the original large text

        bool foundMatch = false;                // Assume match hasn't been found to begin with
        for(int q=0; q< find.Length; q++) {     // Go through each query in turn
            if (find[q].Length == 0) continue;  // Ignore empty queries

            int f = 0;  int g = 0;              // Query cursor and text cursor
            bool multiWild = false;             // multiWild is ✪ symbol which represents many wildcard characters
            int multiWildPosition = 0;          

            while(true) {                       // Loop through query characters
                if (f >= find[q].Length || (i + g) >= text.Length) break;       // Bounds checking
                char cf = find[q][f];                                           // Character in the query (f is the offset)
                char cg = text[i + g];                                          // Character in the text (g is the offset)
                if (!caseSensitive) cg = char.ToLowerInvariant(cg);
                if (cf != '★' && cf != '✪' && cg != cf && !multiWild) break;        // Break search, and thus no match is found
                if (cf == '✪') { multiWild = true; multiWildPosition = f; f++; continue; }              // Multi-char wildcard activated. Move query cursor, and reloop
                if (multiWild && cg != cf && cf != '★') { f = multiWildPosition + 1; g++; continue; }   // Match since MultiWild has failed, so return query cursor to MultiWild position
                f++; g++;                                                           // Reaching here means that a single character was matched, so move both query and text cursor along one
            }

            if (f == find[q].Length) {          // If true, query cursor has reached the end of the query, so a match has been found!!!
                sb.Append(replace[q]);          // Append replacement
                foundMatch = true;
                if (find[q][f - 1] == '✪') { i = text.Length; break; }      // If the MultiWild is the last char in the query, then the rest of the string is a match, and so close off
                i += g - 1;                                                 // Move text cursor along by the amount equivalent to its found match
            }
        }
        if (!foundMatch) sb.Append(text[i]);    // If a match wasn't found at that point in the text, then just append the original character
    }
    return sb;
}