使用CSV列表搜索目录中的数千个文件

时间:2019-08-26 12:39:34

标签: performance powershell glob

我正在使用Powershell 5.0,并且我有一个.CSV文件,该文件包含我要搜索的列表siebelid(大约:5000),并且我想在服务器上的每个文件夹和子文件夹中搜索任何文件在文件名中包含该列表项(siebelid)。即文件名:32444167.pdf或32444167.pdf.metadata.properties.xml

示例CSV文件:

32444167,ACME,4/15/2013
27721071,ACME,4/15/2013
27721072,ACME,4/15/2013

我正在过滤*.PDF*.XML。然后,我想将找到的文件复制到同一服务器上的目标文件夹。问题是,文件夹和子文件夹中有成千上万个文件。我编写的代码似乎要花很长时间才能运行几天。我不是专家,并且我还没有写出最有效的Powershell脚本。任何帮助,将不胜感激。

基本上,代码可以工作,但是在通过包含数十万个文件的文件夹进行处理时,它的运行速度非常慢。每次从列表中获取新项目时,调用Get-Childitem似乎很有效。

$PDFExtension = '.pdf'
$XMLExtension = '.pdf.metadata.properties.xml'
$source = 'C:\Temp\CSVtoXML'
$destination = 'C:\Temp\FindFiles\'                                                           #' 
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$log = $destination + "FileCopyLog.txt"

$FileList = import-csv “C:\Temp\FindFiles\test.csv” -Delimiter "," -Header 'siebelId', 'companyCode', 'receivedDate'
$GetFiles = @(Get-ChildItem -path $source -Recurse -File -include *.xml, *.pdf ) | select -First 100000

ForEach ($item in $FileList){
 $siebelId = $($item.siebelId) + $PDFExtension
 $XMLFile = $($item.siebelId) + $XMLExtension

 $FilterFiles = @($GetFiles) | Where-Object {$_.name -eq $siebelId -or $_.name -eq $XMLFile} #|  Out-File $destination"FileCopyLog.csv"
 #write-host "Filtered Files: " $FilterFiles

 ForEach ($file in $FilterFiles){

   $fileBase = $file.BaseName
   $fileExt = $file.Extension

   write-host "file: " $fileBase$fileExt

   If (-not ([string]::IsNullOrEmpty($file))) {
       if(!(Test-Path -Path $Destination$fileBase$fileExt)) {
            copy-item $file -destination $destination   # Copies files
            write-host "File: [" $file "] has Been Copied! to " $Destination `n`r -ForegroundColor yellow
            $strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
            $LogValue = $strGetDate + ': ' + "Source: [" + $file + "] Destination: " + $Destination
            Add-Content -Path $log -Value $LogValue
       } else
       {
            write-host "File: [" $file "] already exsits in destination folder" `n`r -ForegroundColor yellow
            $strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
            $LogValue = $strGetDate + ': ' + "File: [" + $file + "] already exsits in destination folder! "
            Add-Content -Path $log -Value $LogValue 
       }

   }else{
       write-host "No File was copied!" `n`r -ForegroundColor red
   }
 }
}

write-host 'Script has completed' -ForegroundColor green


我正在寻找的预期结果是在几个小时内而不是几天内完成此过程。

3 个答案:

答案 0 :(得分:1)

对文件进行循环而不是循环过滤。

修改为使用“ .pdf.metadata.properties.xml”而不是XML,并通过从我们找到的文件的“基本名称”中提取“ .pdf.metadata.properties”来匹配这些内容。

修改

另外,通过生成目标文件列表,然后过滤要按fi复制的文件,可以减少脚本复制时间,从而减少复制过程中的时间



$Exts =@('.pdf','.pdf.metadata.properties.xml')

$source = 'C:\Temp\CSVtoXML'
$destination = 'C:\Temp\FindFiles\'                                                           #' 
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$log = "$($destination)FileCopyLog.txt"

$SiebelIDFile="$($destination)test.csv"
$SiebelIDImport = import-csv $SiebelIDFile -Delimiter "," -Header 'siebelId', 'companyCode', 'receivedDate'

$SRC_Matched_Exts = $(  $Exts | % { Get-ChildItem -path $source -Recurse -File -Filter $_  } )


# Presto we can filter the list using the Siebel IDs


$Results = $SRC_Matched_Exts | ? { $( $($_.basename) -replace '.pdf.metadata.properties','' ) -in $($SiebelIDImport.SiebelID) }

# Confirm results by outputting first 1000
$Results | select -first 100 | FT -property BaseName, FullName -Auto 

# Get Destination Files to compare:
$Dst_Matched_Exts = $(  $Exts | % { Get-ChildItem -path $Destiation -Recurse -File -Filter $_  } )

# Filter to only the Source files notin the destination:
$Src_Files_MissingFromDst = $Results | ? { $_.basename -notin $( $Dst_Matched_Exts.basename ) }
$Src_Files_AlreadyInDs = $Results | ? { $_.basename -notin $Src_Files_MissingFromDst.basename }


# Output some of the Files we won't Copy because they already exist in dst:
Write-host "
 Output some of the Files we won't Copy because they already exist in dst:

$($Src_Files_AlreadyInDst | select -first 100 | FT -property BaseName, FullName -Auto | Out-String)" -ForegroundColor red

# Output some of the Files we will Copy:
Write-host "
 Output some of the Files we will Copy:

$Src_Files_MissingFromDst | select -first 100 | FT -property BaseName, FullName -Auto | Out-String )" -ForegroundColor yellow

$Count=0
# Loop Files and Copy them to Destination:
$Src_Files_MissingFromDst | %{
  $Count+=1
  copy-item $($_.Fullname) -destination $destination   # Copies files
  Add-Content -Path $log -Value "$(Get-Date -UFormat `"%Y-%m-%d %H:%M:%S`")`: Source File # $Count: [$($file)] Destination: $Destination"
  # Update the copy progress every 10 files
  IF ( ! [bool]( $Count % 10 ) -or $Count -eq $($Src_Files_MissingFromDst.count)  ) {
    Write-Progress -Activity "======== Copying to $Destination" -Status "## $([math]::round( $(($Count/$($Src_Files_MissingFromDst.count))*100), 1))% Complete!" -PercentComplete $([math]::round( $(($Count/$($Src_Files_MissingFromDst.count))*100), 1))
    write-host "File # $Count: [ $file ] has Been Copied to  $Destination " -ForegroundColor Green
  }

}

现在您可以根据匹配文件的集合来编写文件副本/移动了-使用并行进程来加快速度很有意义。

循环总是比用select语句过滤要慢,而且在命令上使用嵌入式过滤器总是比过滤结果更好的路径,因为过滤是在收集数据时在较低级别进行的。

答案 1 :(得分:0)

尝试:

$(Get-ChildItem -path $source -Recurse -File -Filter *.xml
  Get-ChildItem -path $source -Recurse -File -Filter *.pdf)

答案 2 :(得分:0)

siebelID似乎有8位数字,您可以用它来选择文件。

我不确定什么更有效:

  • 对树两次爬网(对于每个扩展名)或
  • 仅使用一次Where-Object和一个正则表达式一次即可提取数字并检查$Filelist中是否存在

应将输出降低到加快处理速度所必需的绝对值。

以下脚本还消除了创建$LogValue

时的冗余
## Q:\Test\2019\08\26\SO_57658091.ps1
$source = 'Q:\Test\2019' # 'C:\Temp\CSVtoXML'    # 
$target = 'A:\Test\2019' # 'C:\Temp\FindFiles\'  # 
$log = Join-Path $target  "FileCopyLog.txt"

$RE = '^(?<siebelID>\d{8})\.pdf(\.metadata\.properties\.xml)?'
$FileList = Import-Csv "C:\Temp\FindFiles\test.csv" -Header siebelId,companyCode,receivedDate

Get-ChildItem -path $source -Recurse -File -Filter '*.pdf*' |
  Where-Object {($_.Name -match $RE ) -and
                ($Matches.siebelID -in $FileList.siebelID)} | 
ForEach-Object{
    if(!(Test-Path (Join-Path $target $_.Name))) {
        Copy-Item $_.FullName -Destination $target   # Copies files
        $Copied = 'copied to {0}' -f $target
    } else {
        $Copied = 'present in destination'
    }
    $LogValue = '{0}: File: [{1}] {2}' -f (Get-Date -UFormat "%Y-%m-%d %H:%M:%S"),$_.Name,$Copied
    # $LogValue  # optionally output, but that slows down.
    Add-Content -Path $log -Value $LogValue 
}

write-host 'Script has completed' -ForegroundColor green

稍加修改的版本即可使用存储的SO脚本在我的测试文件夹中进行搜索,而该脚本恰好也具有8位数字,从而产生此FileCopyLog.txt

2019-08-26 17:46:03: File: [SO_55464728.ps1] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55569099.ps1] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55575835.cmd] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55575543.ps1] copied to A:\Test\2019