Question

我有一个制表符分隔的文件，例如：

tyuy    wqf fdfd
zx c    vbn 733t 601    asd

最后一行就像zx c[tab]vbn[tab]733t 601[tab]asd。

我需要修剪2Gb文件中第一个选项卡之前的数据，每行大约100个字符。

我要在第一个标签页后逐行复制文件内容

wqf fdfd
vbn 733t 601    asd

我写了一个适用于小型测试文件的脚本

 powershell -Command "(gc in.txt) -replace '^[^\t]+\t' , '$1' | Out-File -encoding ASCII  out.txt"

但是，它消耗了10Gb的内存，并且要花几个小时才能运行。有没有办法使此脚本更快？ cmd.exe的蝙蝠文件也可以使用。无法在该计算机上安装Python和Perl。

Answer 1

我将使用-split运算符来获取第一个制表符之后的部分。
因为您使用的是大文件，所以这些选项可能对您更好：

使用[System.IO.File]::ReadLines

foreach ($line in [System.IO.File]::ReadLines("D:\in.txt")) {
    Add-Content -Path 'D:\out.txt' -Value ($line -split '\t', 2 )[-1]
}

但是使用StreamReader和StreamWriter

可能更快

$reader = New-Object System.IO.StreamReader("D:\in.txt")
$writer = New-Object System.IO.StreamWriter("D:\out.txt")
while (($line = $reader.ReadLine()) -ne $null) {
    $writer.WriteLine(($line -split '\t', 2 )[-1])
}
$reader.Dispose()
$writer.Dispose()

Answer 2

Get-Content对于大文件而言效率低下。使用.NET System.IO.File类的方法是一种更好的方法。

请查看本文以比较不同技术：Reading large text files with Powershell

Powershell字符串替换运行缓慢

2 个答案: