Question

我有一个2GB的文本文件和一个500MB的文本文件。 2GB的格式略显晦涩：例如：样品：

CD 15
IG ABH
NU 1223
**
CD 17
IG RFT
NU 3254
**

其中**是记录之间的标记。

我需要提取NU的所有值，其中CD是特定值;然后，我需要浏览500MB的文本文件，然后将其中的所有记录与2GB文件中的NU值进行匹配，然后将其写入新文件。

我知道PHP。除了文件的大小之外，这在PHP中是微不足道的。即使使用fgets一次读取一行也不会真正起作用，然后在localhost中崩溃我的计算机（在XAMPP下，apache.exe增长以耗尽所有系统内存）。加上用PHP做这将是一件痛苦的事情（非技术人员可以运行，因此他们需要在每周上线时从FTP服务器下载2GB和500MB;将它们上传到我的FTP服务器在如此大的文件大小上;在我的服务器上运行一个需要很长时间的脚本等。）

我知道一些VBScript，没有Perl，没有.NET，没有C＃等。我怎样才能编写一个基于Windows的程序，它将在本地运行，一次加载一行文件，而不是因为文件大小？

Answer 1

以下将创建一个哈希（一种关联数组），每个NU有一个（小）元素可在第二个文件中找到。该哈希的大小取决于您在第一个文件中有多少匹配记录。

如果仍然占用太多内存，请将第一个文件分解为较小的部分，多次运行程序，并连接结果。

use strict;
use warnings;

my $qfn_idx = '...';
my $qfn_in  = '...';
my $qfn_out = '...';

my $cd_to_match = ...;

my %nus;
{
   open(my $fh_idx, '<', $qfn_idx)
      or die("Can't open \"$qfn_idx\": $!\n");

   local $/ = "\n**\n";
   while (<$fh_idx>) {
      next if !( my ($cd) = /^CD ([0-9]+)/m );
      next if $cd != $cd_to_match;
      next if !( my ($nu) = /^NU ([0-9]+)/m );
      ++$nus{$nu};
   }
}

{
   open(my $fh_in, '<', $qfn_in)
      or die("Can't open \"$qfn_in\": $!\n");
   open(my $fh_out, '>', $qfn_out)
      or die("Can't create \"$qfn_out\": $!\n");

   local $/ = "\n**\n";
   while (<$fh_in>) {
      next if !( my ($nu) = /^NU ([0-9]+)/m );
      next if !$nus{$nu};
      print($fh_out $_);
   }
}

Answer 2

以下声明VBScript函数一次读取一行源文件并仅在cdfilter字符串与记录中的cd匹配时写入目标文件：

Option Explicit

Const ForReading = 1
Const ForWriting = 2

Sub Extract(srcpath, dstpath, cdfilter)
  Dim fso, src, dst, txt, cd, nu
  Set fso = CreateObject("Scripting.FileSystemObject")
  Set src = fso.OpenTextFile(srcpath, ForReading)
  Set dst = fso.OpenTextFile(dstpath, ForWriting, True)
  While (not src.AtEndOfStream)
    txt = ""
    While (not src.AtEndOfStream) and (txt <> "**")
      txt = src.ReadLine
      If Left(txt, 3) = "CD " Then
        cd = mid(txt, 4)
      End If
      If Left(txt, 3) = "NU " Then
        nu = mid(txt, 4)
      End If
      If txt = "**" Then
        If cd = cdfilter Then
          dst.WriteLine nu
          cd = ""
          nu = ""
        End If
      End If
    Wend
  Wend
End Sub

Convert "input.txt", "output.txt", "17"

Answer 3

与ikegami的想法基本相同，但有一个子程序和一些方便的参数处理。

基本思路是通过将输入记录分隔符$/设置为记录分隔符"\n**\n"来读取完整记录，将该记录转换为哈希值，保存NU值并使用它们以便以后查找。请注意eof对切换模式的使用。

我对CD的输入进行了硬编码，但将其更改为my $CD = shift;将允许您执行以下操作：

script.pl 15 CD.txt NU.txt > outputfile

我并不过分喜欢使用输入记录分隔符，因为它对数据损坏相当不灵活且敏感，例如在eof上缺少换行符。但只要数据一致，就没有问题。

<强>用法：

script.pl CD.txt NU.txt > outputfile

CD.txt是提取NU值以在NU.txt中查找的文件。

<强>代码：

use strict;
use warnings;

my $CD = 15;
my %NU;
my $read = 1;
local $/ = "\n**\n";
while (<>) {
    next unless /\S/; # no blank lines
    my %check = record($_);
    if ($read) {
        if ($check{'CD'} == $CD) {
            $NU{$check{'NU'}}++;
        }
    } else {
        if ($NU{$check{'NU'}}) {
            print;
        }
    }
    $read &&= eof;
}

sub record {
    my $str = shift;
    chomp $str;  # remove record separator **
    return map(split(/ /, $_, 2), split(/\n/, $str));
}

在Windows上解析一个非常大的文本文件

3 个答案: