我有一个包含大量故事的大型文本文件(大约10 GB)。每个故事都以标记$$
开头。以下是该文件的示例:
$$
AA This is story 1
BB 345
$$
AA This is story 2
BB 456
我想将此文件拆分为大约250 MB的大小。但是这些故事都不应该分成两个不同的文件。
任何人都可以帮助我使用Unix或Perl代码吗?
答案 0 :(得分:5)
use strict;
use warnings;
use autodie;
$/ = "\$\$\n";
my $targetsize = 250*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outfh;
my $outsize = 0;
while (my $story = <>) {
chomp($story);
next unless $story; # disregard initial empty chunk
$story = "$/$story";
# no file open yet, or this story takes us farther from the target size
if ( ! $outfile || abs($outsize - $targetsize) < abs($outsize + length($story) - $targetsize) ) {
++$outfile;
open $outfh, '>', "$fileprefix$outfile";
$outsize = 0;
}
$outsize += length($story);
print $outfh $story;
}
答案 1 :(得分:1)
csplit就是你想要的。它与split
的作用相同,但基于模式。
C ++中的替代品(未经测试):
#include <boost/shared_ptr.hpp>
#include <sstream>
#include <iostream>
#include <fstream>
#include <string>
void new_output_file(boost::shared_ptr<std::ofstream> &out, const char *prefix)
{
static int i = 0;
std::ostringstream filename;
filename << prefix << "_" << i++;
out.reset(new std::ofstream(filename));
}
int main(int argc, char **argv)
{
std::ifstream in(argv[1]);
int i = 0;
long size = 0;
const long max_size = 200 * 1024 * 1024;
std::string line;
boost::shared_ptr<std::ofstream> out(NULL);
new_output_file(out, argv[2]);
while(in.good())
{
std::getline(in,line);
size += line.length() + 1 /* line termination char */;
if(size >= max_size && line.length() && line[0] == '$' && line[1] == '$')
{
new_output_file(out, argv[2]);
size = line.length() + 1;
}
out << line << std::endl;
}
return 0;
}
答案 2 :(得分:1)
我修改了ysth的代码并发现它有效。如果你认为,请建议你修改它以使其更好。
use strict;
use warnings;
my $targetsize = 50*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outsize = 0;
my $outfh;
my $temp='';
while (my $line = <>) {
chomp($line);
next unless $line;
# discard initial empty chunk
if($line =~ /^\$\$$/ || $outfile == 0){
$outsize += length($temp);
if ( $outfile == 0 || ($outsize - $targetsize) > 0) {
++$outfile;
if($outfh) {close($outfh);}
open $outfh, '>', "$fileprefix$outfile";
$outsize = 0;
}
$temp='';
}
$temp = $temp.$line;
print $outfh "$line\n";
}