我有一个格式类似于
的文件{1:[...]}{2:[X:11][Y:78][]...}{3:[...]}{4:[...]}{5:
[]
[]
...
[]}$
{1:[...]}{2:[X:43][Y:13][]...}{3:[...]}{4:[...]}{5:
[]
[]
...
[]}$
...
省略号表示许多重复结构或许多重复线。
因此文件由格式相同且由管道字符分隔的段组成。
仅提取每个段的X值的最佳方法是什么?因此我们避免将整个文件加载到内存中。最佳的空间和时间。可能这意味着避免将整个文件加载到内存中。可能我们可以读取每行和正则表达式以匹配{2:[X:nn][
并提取nn
,但这只是行的一小部分。
但是也许有更好的方法?
答案 0 :(得分:1)
有很多解决方法,
给予
var lines = File.ReadLines(@"D:\Test.txt");
注意 :File.ReadLines
返回一个Enumerbale
,因此它将延迟加载每一行
选项1 :使用正向后看和模式(?<=2:\[X:)\d+
foreach (var line in lines)
{
var match = Regex.Match(line,@"(?<=2:\[X:)\d+");
if(match.Success)
Console.WriteLine(match.Value);
}
选项2 :简单的string.Split
foreach (var line in lines)
{
var results = line.Split(new[] { "2:[X:", "][Y:" }, StringSplitOptions.RemoveEmptyEntries);
if(results.Length>1)
Console.WriteLine(results[1]);
}
选项3 :“可能”使用 Pointers fixed
和unsafe
public static unsafe (bool found, int value) ParseLine(string line)
{
const string prefix = "2:[X:";
fixed (char* pLine = line,pPrefix = prefix)
{
var pLen = line.Length + pLine;
var found = false;
var result = 0;
var i = 0;
for (char* p = pLine ,pP = pPrefix; p < pLen; p++)
{
if (!found )
{
if( *p == *(pP+i)) i++;
if( i ==prefix.Length) found = true;
continue;
}
if (*p < '0' || *p > '9')
break;
result = result * 10 + *p - '0';
}
return (found, result);
}
}
...
var results = File.ReadLines(@"D:\Test.txt")
.Select(ParseLine)
.Where(result => result.found)
.Select(result => result.value);
foreach (var result in results)
Console.WriteLine(result);
注意 :这与正则表达式重击无关,只是不同的方法。
我还没有进行基准测试,但是我怀疑 Pointers 将是最快的,split
将是紧随其后的,而 Regex 可能是最慢(即使使用已编译),但是它是最易读,可维护且也可靠的方法(这就是为什么我将其放在首位)
+----------+------------+-----------+-----------+
| Method | Mean | Error | StdDev |
+----------+------------+-----------+-----------+
| RegEx | 3,358.3 us | 65.169 us | 66.923 us |
| Split | 1,980.9 us | 38.440 us | 48.614 us |
| Pointers | 287.4 us | 4.396 us | 4.112 us |
+----------+------------+-----------+-----------+
测试代码
public class Test
{
private Regex _regex;
private string[] data;
[GlobalSetup]
public void Setup()
{
_regex = new Regex(@"(?<=2:\[X:)\d+", RegexOptions.Compiled);
data = File.ReadLines(@"D:\Test3.txt")
.ToArray();
}
[Benchmark]
public List<int> RegEx()
{
return data.Select(line => _regex.Match(line))
.Where(x => x.Success)
.Select(match => int.Parse(match.Value))
.ToList();
}
[Benchmark]
public List<int> Split()
{
return data.Select(line => line.Split(new[] { "2:[X:", "][Y:" }, StringSplitOptions.RemoveEmptyEntries))
.Where(results => results.Length > 1)
.Select(results => int.Parse(results[1]))
.ToList();
}
[Benchmark]
public List<int> Pointers()
{
return data.Select(ParseLine)
.Where(result => result.found)
.Select(result => result.value)
.ToList();
}
public static unsafe (bool found, int value) ParseLine(string line)
{
const string prefix = "2:[X:";
fixed (char* pLine = line,pPrefix = prefix)
{
var pLen = line.Length + pLine;
var found = false;
var result = 0;
var i = 0;
for (char* p = pLine ,pP = pPrefix; p < pLen; p++)
{
if (!found )
{
if( *p == *(pP+i)) i++;
if( i ==prefix.Length) found = true;
continue;
}
if (*p < '0' || *p > '9')
break;
result = result * 10 + *p - '0';
}
return (found, result);
}
}
}