Question

我有一个2.5G长的ASCII文件，大约有370万行。有些行很长。行中将包含有趣的字符，cmds可能会将其解释为转义符或特殊字符。（斜杠，反斜杠，各种花括号等）

我有一系列特定的grep cmds，它们将从文件中提取16行。我想从大文件中删除那16行。

stderr

临时行的长度约为10MB。

现在，我想反转该选择，以便将临时文件从bigfile中删除。

我尝试了

grep pat1 bigfile | grep -v pat2 | grep -v pat3 | grep -v pat4 > temp

结果是“ grep：内存耗尽”。

我可以使用unix shell和简单的TCL脚本来完成此操作。

谢谢格特

Answer 1

尽管对于Tcl程序而言，在内存中保留几十兆字节是微不足道的，但如果可以帮助的话，您不想一次将全部2.5GB内存保留。这意味着我们要保留要排除在内存中的行，并通过以下方式流传输数据：

# Load the exclusions into a list
set f [open "temp"]
set linesToExclude [split [read $f] "\n"]
close $f

# Stream the main data through...
set fIn [open "bigfile"]
set fOut [open "newbigfile" "w"]
while {[gets $fIn line] >= 0} {
    # Only print the line if it isn't in our exclusions
    if {$line ni $linesToExclude} {  # 'ni' for Not In
        puts $fOut $line
    }
}
close $fOut
close $fIn

通常，我不想使用超过几百个字节长的文本行。除此之外，即使它是正式的文本形式，也开始感觉像使用二进制数据……

Answer 2

名称“ temp”表明您对该文件没有真正的需求。然后，您可以像这样在Tcl中完成整个操作：

set fIn [open "bigfile"]
set fOut [open "newbigfile" "w"]
while {[gets $fIn line] >= 0} {
    # Skip the unwanted lines
    if {[regexp pat1 $line] && \
      ![regexp pat2 $line] && \
      ![regexp pat3 $line] && \
      ![regexp pat4 $line]} continue
    # Print lines that made it through
    puts $fOut $line
}
close $fOut
close $fIn

我不知道执行转换所花费的时间，或者甚至是一个问题。

从大文件（TCL或外壳）中删除多行

2 个答案: