使用正则表达式来解析和转换分层文本文件

时间:2014-08-22 00:36:38

标签: c# regex hierarchical

我已经使用Regex将包含层次结构的文件转换为指定的格式,但感觉应该有更好的方法,因为我必须手动确定父节点。正则表达式似乎是自然的选择,因为文件有一些复杂性(我从这个例子中删除了)Regex处理得很好。不过我可以说服不然。

这是问题所在。层次结构由空格缩进表示。实施例

TopLevel
 Next Level
  Leaf 1:24
  Leaf 2:62
 Another 2nd Level
  Leaf 3:1
  Leaf 4:4788
Top Level 2
 Lower Level
  Leaf 5:28298
 Last Level 2
  Leaf 6:9871
  Leaf 7:3

需要有效地转换为字典。这是以下计划的结果。

TopLevel.Next Level.Leaf 1=24
TopLevel.Next Level.Leaf 2=62
TopLevel.Another 2nd Level.Leaf 3=1
TopLevel.Another 2nd Level.Leaf 4=4788
TopLevel 2.Lower Level.Leaf 5=28298
TopLevel 2.Last Level 2.Leaf 6=9871
TopLevel 2.Last Level 2.Leaf 7=3

我的解决方案如下。事实上,我必须搜索捕获组以找出错误的父节点。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplicationTestHierarchyTextToDictionary
{
    class Program
    {
        private const string TestFileContents =
@"TopLevel
 Next Level
  Leaf 1:24
  Leaf 2:62
 Another 2nd Level
  Leaf 3:1
  Leaf 4:4788
Top Level 2
 Lower Level
  Leaf 5:28298
 Last Level 2
  Leaf 6:9871
  Leaf 7:3
";

        private const string ContentLevel1 = "(?<Level1Group>ContentLevel1Header(ContentLevel2)+)+";
        private const string ContentLevel2 = "(?<Level2Group>ContentLevel2Header(ContentDetail)+)";
        private const string ContentLevel1Header = "^(?<Level1HeaderName>IdentifierName)\\s*$\\n";
        private const string ContentLevel2Header = "^\\s(?<Level2HeaderName>IdentifierName)\\s*$\\n";
        private const string ContentDetail = "^\\s{2}(?<DetailName>IdentifierName)\\s*:\\s*(?<DetailValue>\\d*)\\s*$\\n";
        private const string IdentifierName = "(\\w([\\s\\t\\w]*\\w)?)";

        private static readonly string Expression =
            ContentLevel1
            .Replace("(ContentLevel1)", ContentLevel1)
            .Replace("ContentLevel1Header", ContentLevel1Header)
            .Replace("(ContentLevel2)", ContentLevel2)
            .Replace("ContentLevel2Header", ContentLevel2Header)
            .Replace("ContentDetail", ContentDetail)
            .Replace("IdentifierName", IdentifierName);

        private static readonly Regex regex = new Regex(Expression, RegexOptions.Compiled | RegexOptions.Multiline);

        static void Main(string[] args)
        {
            var result = new Dictionary<string, int>();
            Match match = regex.Match(TestFileContents);
            for (int i = 0; i < match.Groups["DetailName"].Captures.Count; i++)
            {
                Capture detailNameCapture = match.Groups["DetailName"].Captures[i];
                string detailName = detailNameCapture.Value;
                string detailValue = match.Groups["DetailValue"].Captures[i].Value;

                // This feels wrong
                Capture level2Group = match.Groups["Level2Group"].Captures.Cast<Capture>().FirstOrDefault(c => c.Contains(detailNameCapture));
                Capture level2Header = match.Groups["Level2HeaderName"].Captures.Cast<Capture>().FirstOrDefault(c => level2Group.Contains(c));
                Capture level1Group = match.Groups["Level1Group"].Captures.Cast<Capture>().FirstOrDefault(c => c.Contains(detailNameCapture));
                Capture level1Header = match.Groups["Level1HeaderName"].Captures.Cast<Capture>().FirstOrDefault(c => level1Group.Contains(c));

                string keyName = String.Format("{0}.{1}.{2}", level1Header, level2Header, detailName);
                result[keyName] = Int32.Parse(detailValue);
            }

            Console.ReadKey();
        }
    }

    static class CaptureHelper
    {
        public static bool Contains(this Capture source, Capture test)
        {
            return source.Index <= test.Index && (source.Index + source.Length) >=    (test.Index + test.Length);
        }
    }
}

有没有更清洁的方法来实现这种效果?

0 个答案:

没有答案