C#轻型骨架解析文件

时间:2019-03-03 21:08:51

标签: c# regex file parsing

我有一个格式类似于

的文件
{1:[...]}{2:[X:11][Y:78][]...}{3:[...]}{4:[...]}{5:
[]
[]
...
[]}$
{1:[...]}{2:[X:43][Y:13][]...}{3:[...]}{4:[...]}{5:
[]
[]
...
[]}$
...

省略号表示许多重复结构或许多重复线。

因此文件由格式相同且由管道字符分隔的段组成。

仅提取每个段的X值的最佳方法是什么?因此我们避免将整个文件加载到内存中。最佳的空间和时间。可能这意味着避免将整个文件加载到内存中。可能我们可以读取每行和正则表达式以匹配{2:[X:nn][并提取nn,但这只是行的一小部分。

但是也许有更好的方法?

1 个答案:

答案 0 :(得分:1)

有很多解决方法,

给予

var lines = File.ReadLines(@"D:\Test.txt");

注意 File.ReadLines返回一个Enumerbale,因此它将延迟加载每一行


选项1 :使用正向后看和模式(?<=2:\[X:)\d+

进行正则表达式
foreach (var line in lines)
{
   var match = Regex.Match(line,@"(?<=2:\[X:)\d+");
   if(match.Success)
      Console.WriteLine(match.Value);  
}

选项2 :简单的string.Split

foreach (var line in lines)
{
   var results = line.Split(new[] { "2:[X:", "][Y:" }, StringSplitOptions.RemoveEmptyEntries);

   if(results.Length>1)
      Console.WriteLine(results[1]);
}

选项3 :“可能”使用 Pointers fixedunsafe

public static unsafe (bool found, int value) ParseLine(string line)
{
   const string prefix = "2:[X:"; 
   fixed (char* pLine = line,pPrefix = prefix)
   {

      var pLen = line.Length + pLine;
      var found = false;
      var result = 0;
      var i = 0;
      for (char* p = pLine ,pP = pPrefix; p < pLen; p++)
      {
         if (!found )
         {
            if( *p == *(pP+i)) i++;
            if( i ==prefix.Length) found = true;
            continue;
         }

         if (*p < '0' || *p > '9')
            break;

         result = result * 10 + *p - '0';


      }

      return (found, result);
   }
}

...

var results = File.ReadLines(@"D:\Test.txt")
                  .Select(ParseLine)
                  .Where(result => result.found)
                  .Select(result => result.value);

foreach (var result in results)
   Console.WriteLine(result);

注意 :这与正则表达式重击无关,只是不同的方法。

我还没有进行基准测试,但是我怀疑 Pointers 将是最快的,split将是紧随其后的,而 Regex 可能是最慢(即使使用已编译),但是它是最易读,可维护且也可靠的方法(这就是为什么我将其放在首位)

基准

+----------+------------+-----------+-----------+
|  Method  |    Mean    |   Error   |  StdDev   |
+----------+------------+-----------+-----------+
| RegEx    | 3,358.3 us | 65.169 us | 66.923 us |
| Split    | 1,980.9 us | 38.440 us | 48.614 us |
| Pointers | 287.4 us   | 4.396 us  | 4.112 us  |
+----------+------------+-----------+-----------+

测试代码

public class Test
{
   private Regex _regex;

   private string[] data;

   [GlobalSetup]
   public void Setup()
   {
      _regex = new Regex(@"(?<=2:\[X:)\d+", RegexOptions.Compiled);

      data = File.ReadLines(@"D:\Test3.txt")
                 .ToArray();
   }

   [Benchmark]
   public List<int> RegEx()
   {
      return data.Select(line => _regex.Match(line))
                 .Where(x => x.Success)
                 .Select(match => int.Parse(match.Value))
                 .ToList();
   }

   [Benchmark]
   public List<int> Split()
   {
      return data.Select(line => line.Split(new[] { "2:[X:", "][Y:" }, StringSplitOptions.RemoveEmptyEntries))
                 .Where(results => results.Length > 1)
                 .Select(results => int.Parse(results[1]))
                 .ToList();
   }

   [Benchmark]
   public List<int> Pointers()
   {
      return data.Select(ParseLine)
                 .Where(result => result.found)
                 .Select(result => result.value)
                 .ToList();
   }

   public static unsafe (bool found, int value) ParseLine(string line)
   {
      const string prefix = "2:[X:"; 
      fixed (char* pLine = line,pPrefix = prefix)
      {

         var pLen = line.Length + pLine;
         var found = false;
         var result = 0;
         var i = 0;
         for (char* p = pLine ,pP = pPrefix; p < pLen; p++)
         {
            if (!found )
            {
               if( *p == *(pP+i)) i++;
               if( i ==prefix.Length) found = true;
               continue;
            }

            if (*p < '0' || *p > '9')
               break;

            result = result * 10 + *p - '0';


         }

         return (found, result);
      }
   }
}