Question

我需要将两个文件合并到一个新文件中。

两者有超过300百万个以管道分隔的记录，第一列为主键。行未排序。第二个文件可能有第一个文件没有的记录。

示例文件1：

1001234|X15X1211,J,S,12,15,100.05

示例文件2：

1231112|AJ32,,,18,JP     
1001234|AJ15,,,16,PP

输出：

1001234,X15X1211,J,S,12,15,100.05,AJ15,,,16,PP

我正在使用以下代码：

tie %hash_REP, 'Tie::File::AsHash', 'rep.in', split => '\|'
my $counter=0;
while (($key,$val) = each %hash_REP) {
    if($counter==0) {
        print strftime "%a %b %e %H:%M:%S %Y", localtime;
    }
}

准备关联数组需要将近1个小时。它真的很好还是真的很糟糕？有没有更快的方法来处理关联数组中的这种大小的记录？任何脚本语言的任何建议都会有所帮助。

谢谢， Nitin T。

我也试过以下程序，walso花了1+小时如下：

#!/usr/bin/perl
use POSIX qw(strftime);
my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "\n";

my %hash;
open FILE, "APP.in" or die $!;
while (my $line = <FILE>) {
     chomp($line);
      my($key, $val) = split /\|/, $line;
      $hash{$key} = $val;
 }
 close FILE;

my $filename = 'report.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
open FILE, "rep.in" or die $!;
while (my $line = <FILE>) {
      chomp($line);
  my @words = split /\|/, $line;
  for (my $i=0; $i <= $#words; $i++) {
    if($i == 0)
    {
       next;
    }
    print $fh  $words[$i] . "|^"
  }
  print $fh  $hash{$words[0]} . "\n";
 }
 close FILE;
 close $fh;
 print "done\n";

my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "\n";

Answer 1

由于某些原因，您的技术效率非常低。

搭便车非常慢。
你把所有东西都拉进记忆中。

第一个可以通过自己阅读和分裂来缓解，但后者总是会成为一个问题。经验法则是避免将大量数据存入内存。它会占用所有内存并可能导致它交换到磁盘并减慢waaaay，特别是如果你使用的是旋转磁盘。

相反，您可以在GDBM_File或BerkleyDB等模块中使用各种“磁盘哈希”。

但是真的没有理由搞乱他们，因为我们有SQLite，它会做得越来越好。

在SQLite中创建一个表。

create table imported (
    id integer,
    value text
);

使用sqlite shell .import使用.mode和.separator调整格式来导入文件。

sqlite>     create table imported (
   ...>         id integer,
   ...>         value text
   ...>     );
sqlite> .mode list
sqlite> .separator |
sqlite> .import test.data imported
sqlite> .mode column
sqlite> select * from imported;
12345       NITIN     
12346       NITINfoo  
2398        bar       
9823        baz

现在，您和其他任何必须处理数据的人都可以使用高效，灵活的SQL做任何您喜欢的事情。即使导入需要一段时间，你也可以在其他地方做其他事情。

Answer 2

我使用sort非常快速地对数据进行排序（10,000,000行为5秒），然后合并已排序的文件。

perl -e'
   sub get {
      my $fh = shift;
      my $line = <$fh>;
      return () if !defined($line);

      chomp($line);
      return split(/\|/, $line);
   }

   sub main {
      @ARGV == 2
         or die("usage\n");

      open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
      open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);

      my ($key1, $val1) = get($fh1)  or return;
      my ($key2, $val2) = get($fh2)  or return;

      while (1) {
         if    ($key1 < $key2) { ($key1, $val1) = get($fh1)  or return; }
         elsif ($key1 > $key2) { ($key2, $val2) = get($fh2)  or return; }
         else {
            print("$key1,$val1,$val2\n");
            ($key1, $val1) = get($fh1)  or return;
            ($key2, $val2) = get($fh2)  or return;
         }
      }
   }

   main();
' file1 file2 >file

对于每个文件中的10,000,000条记录，在慢速机器上花费37秒。

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2

$ time perl -e'...' file1 file2 >file
real    0m37.030s
user    0m38.261s
sys     0m1.750s

或者，可以将数据转储到数据库中并让它处理细节。

sqlite3 <<'EOI'
CREATE TABLE file1 ( id INTEGER, value TEXT );
CREATE TABLE file2 ( id INTEGER, value TEXT );
.mode list
.separator |
.import file1 file1
.import file2 file2
.output file
SELECT file1.id || "," || file1.value || "," || file2.value
  FROM file1
  JOIN file2
    ON file2.id = file1.id;
.exit
EOI

但是你为灵活性付出了代价。这花了两倍的时间。

real    1m14.065s
user    1m11.009s
sys     0m2.550s

注意：我最初在CREATE INDEX file2_id ON file2 ( id );命令之后有.import，但删除它有助于提高性能。

perl

2 个答案: