答案 0 :(得分:2)
创建一个包含100万行的文件:
perl -e 'for (1..1000000) { print "line $_ - and some data_$_\n" }' > large_file
这是一个用于对大文件进行采样的perl脚本:
<强> sample_size.pl 强>
#!/usr/bin/env perl
use warnings;
use strict;
my ($filename, $n) = @ARGV;
$filename
or die "usage: $0 filename sample_size";
-f $filename
or die "Invalid filename '$filename'";
chomp(my ($word_count_lines) = `/usr/bin/wc -l $filename`);
my ($lines, undef) = split /\s+/, $word_count_lines;
die "Need to pass in sample size"
unless $n;
my $sample_size = int $n;
die "Invalid sample size '$n', should in the between [ 0 - $lines ]"
unless (0 < $sample_size and $sample_size < $lines);
# Pick some random line numbers
my %sample;
while ( keys %sample < $sample_size ) {
$sample{ 1+int rand $lines }++;
}
open my $fh, $filename
or die "Unable to open '$filename' for reading : $!";
open my $fh_sample, "> $filename.sample"
or die "Unable to open '$filename.sample' for writing : $!";
open my $fh_remainder, "> $filename.remainder"
or die "Unable to open '$filename.remainder' for writing : $!";
my $current_fh;
while (<$fh>) {
my $line_number = $.;
$current_fh = $sample{ $line_number } ? $fh_sample : $fh_remainder;
# Write to correct file
print $current_fh $_;
}
close $fh
or die "Unable to finish reading '$filename' : $!";
close $fh_sample
or die "Unable to finish writing '$filename.sample' : $!";
close $fh_remainder
or die "Unable to finish writing '$filename.sample' : $!";
print "Original file '$filename' has $lines rows\n";
print "Created '$filename.sample' with $sample_size rows\n";
print "Created '$filename.remainder' with " . ($lines - $sample_size) . " rows\n";
print "Run 'mv $filename.remainder $filename' if you are happy with this result\n";
运行脚本
$ perl ./sample_size.pl large_file 10
<强>输出强>
Original file 'large_file' has 1000000 rows
Created 'large_file.sample' with 10 rows
Created 'large_file.remainder' with 999990 rows
Run 'mv large_file.remainder large_file' if you are happy with this result