Text Parsing - My Parser Skipping命令

时间:2010-05-25 19:06:23

标签: c# text-parsing

我正在尝试解析文本格式。我想用反引号(`)标记内联代码,就像SO一样。该规则应该是如果你想在内联代码元素中使用反引号,你应该在内联代码周围使用双反引号。

像这样:

``用反引号标记内联代码(`)``

由于某种原因,我的解析器似乎完全跳过了双重反引号。下面是执行内联代码解析的函数的代码:

    private string ParseInlineCode(string input)
    {
        for (int i = 0; i < input.Length; i++)
        {
            if (input[i] == '`' && input[i - 1] != '\\')
            {
                if (input[i + 1] == '`')
                {
                    string str = ReadToCharacter('`', i + 2, input);
                    while (input[i + str.Length + 2] != '`')
                    {
                        str += ReadToCharacter('`', i + str.Length + 3, input);
                    }
                    string tbr = "``" + str + "``";
                    str = str.Replace("&", "&amp;");
                    str = str.Replace("<", "&lt;");
                    str = str.Replace(">", "&gt;");
                    input = input.Replace(tbr, "<code>" + str + "</code>");
                    i += str.Length + 13;
                }
                else
                {
                    string str = ReadToCharacter('`', i + 1, input);
                    input = input.Replace("`" + str + "`", "<code>" + str + "</code>");
                    i += str.Length + 13;
                }
            }
        }
        return input;
    }

如果我在某些内容周围使用单个反引号,则会将其正确包装在<code>标记中。

2 个答案:

答案 0 :(得分:4)

while - 循环

while (input[i + str.Length + 2] != '`')
{
    str += ReadToCharacter('`', i + str.Length + 3, input);
}

你看错了索引 - i + str.Length + 2而不是i + str.Length + 3 - 反过来你必须在正文中添加反引号。应该是

while (input[i + str.Length + 3] != '`')
{
    str += '`' + ReadToCharacter('`', i + str.Length + 3, input);
}

但是代码中还有一些错误。如果输入的第一个字符是反引号,则以下行将导致IndexOutOfRangeException

 if (input[i] == '`' && input[i - 1] != '\\')

如果输入包含奇数个分隔的反引号并且输入的最后一个字符是反引号,则以下行将导致IndexOutOfRangeException

if (input[i + 1] == '`')

您应该将代码转换为更小的方法,而不是处理单个方法中的许多情况 - 这很容易出错。如果您没有为代码编写单元测试,我强烈建议您这样做。并且因为解析器不是很容易测试因为各种无效输入你必须为你准备好看看PEX - 一个通过分析所有分支点自动为你的代码生成测试用例的工具试图采取所有可能的代码路径。

我很快启动了PEX并针对代码运行它 - 它找到了我想到的IndexOutOfRangeException等等。当然,如果输入是空引用,PEX会发现明显的NullReferenceExceptions。以下是PEX发现导致异常的输入。

case1 = "`"

case2 = "\0`"

case3 = "\0``"

case4 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0````"

case5 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0`"

case6 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0``<\0\0`````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0\0``<\0\0```````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0`\0```````````````"

我的代码“修复”改变了导致异常的输入(也可能引入了新的错误)。 PEX在修改后的代码中捕获了以下内容。

case7 = "\0```"

case8 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0`\0"

case9 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0``<\0\0`````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0\0``\0`\0`\0``"

所有三个输入都没有导致原始代码中的异常,而情况4和6不再导致修改后的代码中出现异常。

答案 1 :(得分:1)

这是在LinqPad中测试的一个小片段,以帮助您入门

void Main()
{
    string test = "here is some code `public void Method( )` but ``this is not code``";
    Regex r = new Regex( @"(`[^`]+`)" );

    MatchCollection matches = r.Matches( test );

    foreach( Match match in matches )
    {
        Console.Out.WriteLine( match.Value );
        if( test[match.Index - 1] == '`' )
            Console.Out.WriteLine( "NOT CODE" );
            else
        Console.Out.WriteLine( "CODE" );
    }
}

输出:

`public void Method( )`
CODE
`this is not code`
NOT CODE