在字符串中解析这个字符串的最佳方法是什么?

时间:2013-12-19 21:28:48

标签: c# regex string parsing

我有以下字符串:

 string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))"

我想把这个字符串解析成

 string group = ParseoutGroup(fullString);  // Expect "2843360"
 string[] teams = ParseoutTeamNames(fullString); // Expect array with three items

就完整字符串的例子而言,我可以列出一个或多个团队(不总是如上所述的三个)。

我有这个部分工作,但我的代码感觉非常hacky并且不是非常未来的证明,所以我想看看这里是否有更好的正则表达式解决方案或更优雅的方式来解析这个完整字符串中的这些值?可能会在字符串后面添加其他内容,所以我希望这样做尽可能万无一失。

6 个答案:

答案 0 :(得分:6)

在最简单的情况下,正则表达式可能是最好的答案。 不幸的是,在这种情况下,我们似乎需要解析SQL语言的一个子集。虽然可以使用正则表达式解决这个问题,但它们并不是为解析复杂语言(嵌套括号和转义字符串)而设计的。

这些要求也可能随着时间的推移而发展,并且需要解析更复杂的结构。

如果公司政策允许,我会选择构建内部DSL以解析此字符串。

我最喜欢的构建内部DLS的工具之一叫做Sprache

您可以在下面找到使用内部DSL方法的示例解析器。

在代码中,我定义了原语来处理所需的SQL运算符,并用这些运算符组成最终的解析器。

    [Test]
    public void Test()
    {
        string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";


        var resultParser =
            from @group in OperatorEquals("group")
            from @and in OperatorEnd()
            from @team in Brackets(OperatorIn("team"))
            select new {@group, @team};
        var result = resultParser.Parse(fullString);
        Assert.That(result.group, Is.EqualTo("2843360"));
        Assert.That(result.team, Is.EquivalentTo(new[] {"TEAM1", "TEAM2", "TEAM3"}));
    }

    private static readonly Parser<char> CellSeparator =
        from space1 in Parse.WhiteSpace.Many()
        from s in Parse.Char(',')
        from space2 in Parse.WhiteSpace.Many()
        select s;

    private static readonly Parser<char> QuoteEscape = Parse.Char('\\');

    private static Parser<T> Escaped<T>(Parser<T> following)
    {
        return from escape in QuoteEscape
               from f in following
               select f;
    }

    private static readonly Parser<char> QuotedCellDelimiter = Parse.Char('\'');

    private static readonly Parser<char> QuotedCellContent =
        Parse.AnyChar.Except(QuotedCellDelimiter).Or(Escaped(QuotedCellDelimiter));

    private static readonly Parser<string> QuotedCell =
        from open in QuotedCellDelimiter
        from content in QuotedCellContent.Many().Text()
        from end in QuotedCellDelimiter
        select content;

    private static Parser<string> OperatorEquals(string column)
    {
        return
            from c in Parse.String(column)
            from space1 in Parse.WhiteSpace.Many()
            from opEquals in Parse.Char('=')
            from space2 in Parse.WhiteSpace.Many()
            from content in QuotedCell
            select content;
    }

    private static Parser<bool> OperatorEnd()
    {
        return
            from space1 in Parse.WhiteSpace.Many()
            from c in Parse.String("and")
            from space2 in Parse.WhiteSpace.Many()
            select true;
    }

    private static Parser<T> Brackets<T>(Parser<T> contentParser)
    {
        return from open in Parse.Char('(')
               from space1 in Parse.WhiteSpace.Many()
               from content in contentParser
               from space2 in Parse.WhiteSpace.Many()
               from close in Parse.Char(')')
               select content;
    }

    private static Parser<IEnumerable<string>> ComaSeparated()
    {
        return from leading in QuotedCell
               from rest in CellSeparator.Then(_ => QuotedCell).Many()
               select Cons(leading, rest);
    }

    private static Parser<IEnumerable<string>> OperatorIn(string column)
    {
        return
            from c in Parse.String(column)
            from space1 in Parse.WhiteSpace
            from opEquals in Parse.String("in")
            from space2 in Parse.WhiteSpace.Many()
            from content in Brackets(ComaSeparated())
            from space3 in Parse.WhiteSpace.Many()
            select content;
    }

    private static IEnumerable<T> Cons<T>(T head, IEnumerable<T> rest)
    {
        yield return head;
        foreach (T item in rest)
            yield return item;
    }

答案 1 :(得分:4)

我设法使用regular expressions

var str = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";

// Grabs the group ID
var group = Regex.Match(str, @"group = '(?<ID>\d+)'", RegexOptions.IgnoreCase)
    .Groups["ID"].Value;

// Grabs everything inside teams parentheses
var teams = Regex.Match(str, @"team in \((?<Teams>(\s*'[^']+'\s*,?)+)\)", RegexOptions.IgnoreCase)
    .Groups["Teams"].Value;

// Trim and remove single quotes
var teamsArray = teams.Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
    .Select(s =>
        {
            var trimmed = s.Trim();
            return trimmed.Substring(1, trimmed.Length - 2);
        }).ToArray();

结果将是:

string[] { "TEAM1", "TEAM2", "TEAM3" }

答案 2 :(得分:1)

可能有一个正则表达式解决方案,但如果格式严格,我首先尝试高效的字符串方法。以下内容适用于您的输入。

我正在使用自定义类TeamGroup来封装复杂性并将所有相关属性保存在一个对象中:

public class TeamGroup
{
    public string Group { get; set; }
    public string[] Teams { get; set; }

    public static TeamGroup ParseOut(string fullString)
    {
        TeamGroup tg = new TeamGroup{ Teams = new string[]{ } };
        int index = fullString.IndexOf("group = '");
        if (index >= 0)
        {
            index += "group = '".Length;
            int endIndex = fullString.IndexOf("'", index);
            if (endIndex >= 0)
            {
                tg.Group = fullString.Substring(index, endIndex - index).Trim(' ', '\'');
                endIndex += 1;
                index = fullString.IndexOf(" and (team in (", endIndex);
                if (index >= 0)
                {
                    index += " and (team in (".Length;
                    endIndex = fullString.IndexOf(")", index);
                    if (endIndex >= 0)
                    {
                        string allTeamsString = fullString.Substring(index, endIndex - index);
                        tg.Teams = allTeamsString.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
                            .Select(t => t.Trim(' ', '\''))
                            .ToArray();
                    }
                }
            }
        }
        return tg;
    }
}

你会以这种方式使用它:

string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";
TeamGroup tg = TeamGroup.ParseOut(fullString);
Console.Write("Group: {0} Teams: {1}", tg.Group, string.Join(", ", tg.Teams));

输出:

Group: 2843360 Teams: TEAM1, TEAM2, TEAM3

答案 3 :(得分:0)

我认为你需要研究一个标记化过程,以获得所需的结果,并考虑括号建立的执行顺序。您可以使用分流码算法来协助标记化和执行顺序。

分流码的优点是它允许您定义令牌,以后可以用于属性解析字符串并执行正确的操作。虽然它通常适用于数学运算顺序,但它可以根据您的目的进行调整。

以下是一些信息:

http://en.wikipedia.org/wiki/Shunting-yard_algorithm http://www.slideshare.net/grahamwell/shunting-yard

答案 4 :(得分:0)

如果没有机器生成fullString,您可能需要添加一些错误捕获,但这将开箱即用,并为您提供一个测试工作。

    public string ParseoutGroup(string fullString)
    {
        var matches = Regex.Matches(fullString, @"group\s?=\s?'([^']+)'", RegexOptions.IgnoreCase);
        return matches[0].Groups[1].Captures[0].Value;
    }

    public string[] ParseoutTeamNames(string fullString)
    {
        var teams = new List<string>();
        var matches = Regex.Matches(fullString, @"team\s?in\s?\((\s*'([^']+)',?\s*)+\)", RegexOptions.IgnoreCase);
        foreach (var capture in matches[0].Groups[2].Captures)
        {
            teams.Add(capture.ToString());
        }
        return teams.ToArray();
    }

    [Test]
    public void parser()
    {
        string test = "group = '2843360' and (team in ('team1', 'team2', 'team3'))";
        var group = ParseoutGroup(test);
        Assert.AreEqual("2843360",group);

        var teams = ParseoutTeamNames(test);
        Assert.AreEqual(3, teams.Count());
        Assert.AreEqual("team1", teams[0]);
        Assert.AreEqual("team2", teams[1]);
        Assert.AreEqual("team3", teams[2]);
    }

答案 5 :(得分:0)

@ BrunoLM解决方案的补充:

(如果您稍后要检查更多变量,则值得额外的行):

您可以在“and”关键字上拆分字符串,并且有一个函数可以根据相应的正则表达式语句检查每个子句并返回所需的值。

(未经测试的代码,但它应该提供这个想法。)

statments = statment.split('and')
//So now:
//statments[0] = "group = '2843360' "
//statments[1] = "(team in ('TEAM1', 'TEAM2','TEAM3'))"
foreach s in statments {
    if (s.contains('group') group = RegexFunctionToExtract_GroupValue(s) ;
    if (s.contains('team') teams = RegexFunctionToExtract_TeamValue(s) ;
}

我相信这种方法可以提供更清晰,易于维护的代码和轻微的优化。

当然,这种方法不期望“OR”条款。但是,可以通过稍微调整来完成。