Question

我有一个缩进的文本文件，格式如下：

cl /FoD:\jnks\complire_flags /c legacy\roxapi\fjord\Module.c

Note: including file:   d:\jnks\e\patchlevel.f

Note: including file:   d:\3_4_2_patched4\release\include\pyconfig.f
Note: including file:    C:\11.0\VC\INCLUDE\io.f

使用流阅读器，我能够阅读上述需要后续处理的文件 1）以cl开头并以c结尾的每一行都是父文件 2）所有以Note开头并以f结尾的文件都是子文件 3）如果左缩进增加，则.f文件是它下面的.f文件的父文件（文件：和驱动器名称之间的空格）因此pyconfig.f是io.f的父文件

使用Entity框架我在SQL服务器的两个表中编写上面的数据; 父表（仅用于.c文件）和子表（仅用于.f文件）。

我的重大问题是 - 阅读文件需要6个小时（使用流式阅读器），另外6个小时将其写入数据库（使用实体框架）。我首先尝试阅读整个文件，然后编写它。我也尝试一次读取一个父c文件，并将其信息与子.f文件一起写。

将来文件大小可能会增加到5 GB，所以我非常感谢帮助实现更好的性能。

以下是我的阅读逻辑的一部分：

while (!isEndOfFile)
{
    // Read next Line conditionally
    if (readNextLine)
    {
        if (inputFile.Peek() > -1)
        {
            line = inputFile.ReadLine();
        }
        else
        {
            isEndOfFile = true;
            continue;
        }
    }

    // Get the name of the CPP file - Condition is that it starts with cl
    if (isCPPFile(line))
    {
        // Regular expression match to extract the CPP file name
        Match match = cppFilePathRegex.Match(line);
        if (match.Success)
        {
            cppFileName = match.Value;
            addFileDetails = true;
        }
        readNextLine = true;
    }
    // Check if meets the condition of Header starting text - "Note: including file:" and we have a parent CPP File
    else if (addFileDetails && isHeaderFile(line))
    { 
        //do something
    }

Answer 1

1）去阅读why GNU grep is fast?。它提供了许多关于如何处理快速输入文本文件的提示，特别是寻找模式。

2）使用SQlBulkCopy将数据传输到SQL Server。 EF绝对不是批量导入的合适解决方案。

但是，如果我是你，我会在我的整个导入解决方案上执行del /q /s并使用SQL Server Integration Services从头开始。 SSIS是针对您的任务的专用解决方案，它包含有关文件读取，记录访问，缓冲，缓存访问以及最终数据库写入的无数优化。

Answer 2

如果我在处理文件之前拆分文件，似乎时间大大缩短（差不多2小时）。该文件采用树形结构，因此必须逐行处理，但我可以在发生某个字符的位置将其拆分以表示新树。

如果我在新角色处读取块，而不是分裂; 2 GB的文件仍占用大量内存。

我使用以下电源shell拆分文件，稍后会看到我如何调用powershell和MY C-SHARP APPLICATION（用于处理和数据库插入）。我仍在努力减少时间，但请在下面找到我的powershell以供参考。

//我的PowerShell

$Path = "D:\Parser\Test\"            -- path of input file
$PathSplit = "D:\Parser\Test\Cpp\"   -- path of output
$InputFile = (Join-Path $Path "input_file.txt")       --input filename
$Reader = New-Object System.IO.StreamReader($InputFile)
$N = 1
While(($Line = $Reader.ReadLine()) -ne $null)
{
    If(($Line -match "^[cl].*")-and($Line -match "/Fo")) {
        $OutputFile = $matches+$N + ".txt"
        Add-Content(Join-Path $PathSplit $OutputFile) $Line
        $N++     
    }}

使用C＃读取2 GB文件需要花费太多时间

2 个答案: