Question

我正在尝试将160gb csv文件加载到sql，我正在使用来自Github的powershell脚本，我收到此错误

    IException calling "Add" with "1" argument(s): "Input array is longer than the number of columns in this table."
At C:\b.ps1:54 char:26
+ [void]$datatable.Rows.Add <<<< ($line.Split($delimiter))
    + CategoryInfo          : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : DotNetMethodException

所以我用小3行csv检查了相同的代码并且所有列匹配并且在第一行中也有标题，并且没有额外的分隔符不确定为什么我收到此错误。

代码在

之下

<# 8-faster-runspaces.ps1 #>
# Set CSV attributes
$csv = "M:\d\s.txt"
$delimiter = "`t"

# Set connstring
$connstring = "Data Source=.;Integrated Security=true;Initial Catalog=PresentationOptimized;PACKET SIZE=32767;"

# Set batchsize to 2000
$batchsize = 2000

# Create the datatable
$datatable = New-Object System.Data.DataTable

# Add generic columns
$columns = (Get-Content $csv -First 1).Split($delimiter) 
foreach ($column in $columns) { 
[void]$datatable.Columns.Add()
}

# Setup runspace pool and the scriptblock that runs inside each runspace
$pool = [RunspaceFactory]::CreateRunspacePool(1,5)
$pool.ApartmentState = "MTA"
$pool.Open()
$runspaces = @()

# Setup scriptblock. This is the workhorse. Think of it as a function.
$scriptblock = {
   Param (
[string]$connstring,
[object]$dtbatch,
[int]$batchsize
   )

$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connstring,"TableLock")
$bulkcopy.DestinationTableName = "abc"
$bulkcopy.BatchSize = $batchsize
$bulkcopy.WriteToServer($dtbatch)
$bulkcopy.Close()
$dtbatch.Clear()
$bulkcopy.Dispose()
$dtbatch.Dispose()
}

# Start timer
$time = [System.Diagnostics.Stopwatch]::StartNew()

# Open the text file from disk and process.
$reader = New-Object System.IO.StreamReader($csv)

Write-Output "Starting insert.."
while ((($line = $reader.ReadLine()) -ne $null))
{
[void]$datatable.Rows.Add($line.Split($delimiter))

if ($datatable.rows.count % $batchsize -eq 0) 
{
   $runspace = [PowerShell]::Create()
   [void]$runspace.AddScript($scriptblock)
   [void]$runspace.AddArgument($connstring)
   [void]$runspace.AddArgument($datatable) # <-- Send datatable
   [void]$runspace.AddArgument($batchsize)
   $runspace.RunspacePool = $pool
   $runspaces += [PSCustomObject]@{ Pipe = $runspace; Status = $runspace.BeginInvoke() }

   # Overwrite object with a shell of itself
  $datatable = $datatable.Clone() # <-- Create new datatable object
}
}

# Close the file
$reader.Close()

# Wait for runspaces to complete
while ($runspaces.Status.IsCompleted -notcontains $true) {}

# End timer
$secs = $time.Elapsed.TotalSeconds

# Cleanup runspaces 
foreach ($runspace in $runspaces ) { 
[void]$runspace.Pipe.EndInvoke($runspace.Status) # EndInvoke method retrieves the results of the asynchronous call
$runspace.Pipe.Dispose()
}

# Cleanup runspace pool
$pool.Close() 
$pool.Dispose()

# Cleanup SQL Connections
[System.Data.SqlClient.SqlConnection]::ClearAllPools()

# Done! Format output then display
$totalrows = 1000000
$rs = "{0:N0}" -f [int]($totalrows / $secs)
$rm = "{0:N0}" -f [int]($totalrows / $secs * 60)
$mill = "{0:N0}" -f $totalrows

Write-Output "$mill rows imported in $([math]::round($secs,2)) seconds ($rs rows/sec and $rm rows/min)"

Answer 1

使用160 GB的输入文件会很麻烦。你无法将它真正加载到任何类型的编辑器中 - 或者至少你没有真正分析这样的数据量而没有一些严肃的自动化。

根据评论，似乎数据存在一些质量问题。为了找到有问题的数据，您可以尝试二进制搜索。这种方法可以快速缩小数据。像这样，

1) Split the file in about two equal chunks.
2) Try and load first chunk.
3) If successful, process the second chunk. If not, see 6).
4) Try and load second chunk.
5) If successful, the files are valid, but you got another a data quality issue. Start looking into other causes. If not, see 6).
6) If either load failed, start from the beginning and use the failed file as the input file.
7) Repeat until you narrow down the offending row(s).

另一种方法是使用像SSIS这样的ETL工具。配置程序包以将无效行重定向到错误日志中，以查看哪些数据无法正常工作。

错误：输入数组比此表powershell中的列数长

1 个答案: