如何在不截断记录的情况下将大型文本文件拆分为大小均匀的文件?

时间:2011-01-31 15:17:11

标签: perl unix

我有一个包含大量故事的大型文本文件(大约10 GB)。每个故事都以标记$$开头。以下是该文件的示例:

$$
AA This is story 1
BB 345

$$

AA This is story 2
BB 456

我想将此文件拆分为大约250 MB的大小。但是这些故事都不应该分成两个不同的文件。

任何人都可以帮助我使用Unix或Perl代码吗?

3 个答案:

答案 0 :(得分:5)

use strict;
use warnings;
use autodie;

$/ = "\$\$\n";
my $targetsize = 250*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outfh;
my $outsize = 0;
while (my $story = <>) {
    chomp($story);
    next unless $story; # disregard initial empty chunk
    $story = "$/$story";

    # no file open yet, or this story takes us farther from the target size
    if ( ! $outfile || abs($outsize - $targetsize) < abs($outsize + length($story) - $targetsize) ) {
        ++$outfile;
        open $outfh, '>', "$fileprefix$outfile";
        $outsize = 0;
    }

    $outsize += length($story);
    print $outfh $story;
}

答案 1 :(得分:1)

csplit就是你想要的。它与split的作用相同,但基于模式。

C ++中的替代品(未经测试):

#include <boost/shared_ptr.hpp>
#include <sstream>
#include <iostream>
#include <fstream>
#include <string>

void new_output_file(boost::shared_ptr<std::ofstream> &out, const char *prefix)
{
    static int i = 0;
    std::ostringstream filename;
    filename << prefix << "_" << i++;
    out.reset(new std::ofstream(filename));
}

int main(int argc, char **argv)
{
    std::ifstream in(argv[1]);
    int i = 0;
    long size = 0;
    const long max_size = 200 * 1024 * 1024;
    std::string line;
    boost::shared_ptr<std::ofstream> out(NULL);
    new_output_file(out, argv[2]);
    while(in.good())
    {
        std::getline(in,line);
        size += line.length() + 1 /* line termination char */;
        if(size >= max_size && line.length() && line[0] == '$' && line[1] == '$')
        {
            new_output_file(out, argv[2]);
            size = line.length() + 1;
        }
        out << line << std::endl;
    }
    return 0;
}

答案 2 :(得分:1)

我修改了ysth的代码并发现它有效。如果你认为,请建议你修改它以使其更好。

use strict;
use warnings;

my $targetsize = 50*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outsize = 0;
my $outfh;
my $temp='';
while (my $line = <>)  {
  chomp($line);
  next unless $line;
  # discard initial empty chunk  
  if($line =~ /^\$\$$/ || $outfile == 0){
        $outsize += length($temp);
        if ( $outfile == 0 || ($outsize - $targetsize) > 0)  { 
              ++$outfile; 
              if($outfh) {close($outfh);}
              open $outfh, '>', "$fileprefix$outfile"; 
              $outsize = 0;
        }
        $temp='';
    }
  $temp = $temp.$line;
  print $outfh "$line\n";  
}