Question

我有两个文件，其中一些内容可能在两者中都很常见。（比如文件A.txt和文件B.txt）这两个文件都是排序文件。我需要获得文件A.txt和B.txt的区别，即文件C.txt，其内容为A，但两者中的公共内容除外。

我使用了典型的搜索和打印算法，即从A.txt获取一行，在B.txt中搜索，如果找到，则在C.txt中不打印，否则在C.txt打印该行{1}}。但是，我正在处理包含大量内容的文件，因此会抛出错误：failed to load too many files。（虽然它适用于较小的文件）

有人能建议更有效地获取C.txt吗？要使用的脚本：仅限TCL！

Answer 1

首先，too many files错误表示您未关闭频道，可能在B.txt扫描仪中。修复这可能是你的第一个目标。如果您已获得Tcl 8.6，请尝试以下帮助程序：

proc scanForLine {searchLine filename} {
    set f [open $filename]
    try {
        while {[gets $f line] >= 0} {
            if {$line eq $searchLine} {
                return true
            }
        }
        return false
    } finally {
        close $f
    }
}

但是，如果其中一个文件足够小，无法合理地放入内存中，那么将它读入哈希表（例如字典或数组）要好得多：

set f [open B.txt]
while {[gets $f line]} {
    set B($line) "any dummy value; we'll ignore it"
}
close $f

set in [open A.txt]
set out [open C.txt w]
while {[gets $in line]} {
    if {![info exists B($line)]} {
        puts $out $line
    }
}
close $in
close $out

更高效，但取决于B.txt足够小。

如果A.txt和B.txt都太大了，你可能最好分阶段进行某种处理，在中间写入磁盘。这变得越来越复杂了！

set filter [open B.txt]
set fromFile A.txt

for {set tmp 0} {![eof $filter]} {incr tmp} {
    # Filter by a million lines at a time; that'll probably fit OK
    for {set i 0} {$i < 1000000} {incr i} {
        if {[gets $filter line] < 0} break
        set B($line) "dummy"
    }

    # Do the filtering
    if {$tmp} {set fromFile $toFile}
    set from [open $fromFile]
    set to [open [set toFile /tmp/[pid]_$tmp.txt] w]
    while {[gets $from line] >= 0} {
        if {![info exists B($line)]} {
            puts $to $line
        }
    }
    close $from
    close $to

    # Keep control of temporary files and data
    if {$tmp} {file delete $fromFile}
    unset B
}
close $filter
file rename $toFile C.txt

警告！我没有测试过这段代码......

如何在tcl中查找两个大文件之间的区别？

1 个答案: