提高检查文件分隔符的性能

时间:2016-03-14 23:28:41

标签: performance powershell powershell-v4.0 powershell-ise

花了一些时间寻找最清晰的方法来检查文件的主体是否与标题具有相同数量的分隔符,我想出了这段代码:

Param #user enters the directory path and delimiter they are checking for
(
    [string]$source,
    [string]$delim
)

#try {
$lineNum = 1
$thisOK = 0
$badLine = 0
$noDelim = 0
$archive = ("*archive*","*Archive*","*ARCHIVE*");

foreach ($files in Get-ChildItem $source -Exclude $archive) #folder directory may have sub folders, as a temp workaround just made sure to exclude any folder with archive
{
    $read2 = New-Object System.IO.StreamReader($files.FullName)
    $DataLine = (Get-Content $files.FullName)[0]
    $validCount = ([char[]]$DataLine -eq $delim).count #count of delimeters in the header
    $lineNum = 1 #used to write to host which line is bad in file
    $thisOK = 0 #used for if condition to let the host know that the file has delimeters that line up with header
    $badLine = 0 #used so the write-host doesnt meet the if condition and write the file is ok after throwing an error

    while (!$read2.EndOfStream)
    {
        $line = $read2.ReadLine()
        $total = $line.Split($delim).Length - 1;

        if ($total -eq $validCount)
        {
            $thisOK = 1
        }
        elseif ($total -ne $validCount)
        {
            Write-Output "Error on line $lineNum for file $files. Line number $lineNum has $total delimeters and the header has $validCount"
            $thisOK = 0
            $badLine = 1
            break; #break or else it will repeat each line that is bad
        }
        $lineNum++
    }
    if ($thisOK = 1 -and $badLine -eq 0 -and $validCount -ne 0)
    {
        Write-Output "$files is ok"
    }
    if ($validCount -eq 0)
    {
        Write-Output "$files does not contain entered delimeter: $delim"
    }
    $read2.Close()
    $read2.Dispose()
} #end foreach loop
#} catch {
#    $ErrorMessage = $_.Exception.Message
#    $FailedItem = $_.Exception.ItemName
#}

它适用于我迄今为止所测试的内容。但是,当涉及到更大的文件时,需要更长的时间。我想知道我可以做什么或更改此代码以使其更快地处理这些文本/ CSV文件?

此外,我的try..catch语句已被注释掉,因为当我包含它们时脚本似乎没有运行 - 没有错误只是进入一个新的命令行。作为一种想法,我希望将其他用户的简单GUI结合起来进行仔细检查。

示例文件:

HeaderA|HeaderB|HeaderC|HeaderD          //header line
DataLnA|DataLnBBDataLnC|DataLnD|DataLnE  //bad line
DataLnA|DataLnB|DataLnC|DataLnD|         //bad line
DataLnA|DataLnB|DataLnC|DataLnD          //good line

现在我看一下,我想可能存在一个问题,即如果分界符号有正确的数量,但列不匹配:

HeaderA|HeaderB|HeaderC|HeaderD
DataLnA|DataLnBDataLnC|DataLnD|

1 个答案:

答案 0 :(得分:0)

我看到的主要问题是你正在读取文件两次 - 一次调用Get-Content,将整个文件读入内存,第二次调用你的while循环。您可以通过替换此行来加快流程速度:

$DataLine = (Get-Content $files.FullName)[0]    #inefficient

用这个:

$DataLine = Get-Content $files.FullName -First 1   #efficient