从原始文本切割的文本文件中选择n随机行并粘贴到新文件中

时间:2016-05-27 08:14:55

标签: linux command-line

我有一个大文本文件。 我想随机选择n行,将其从原始文件中删除并将其放入新文件中。

给出了解决方案here,但他们不会删除原始文件中的行。

感谢

1 个答案:

答案 0 :(得分:2)

创建一个包含100万行的文件:

perl -e 'for (1..1000000) { print "line $_ - and some data_$_\n" }' > large_file

这是一个用于对大文件进行采样的perl脚本:

<强> sample_size.pl

#!/usr/bin/env perl

use warnings;
use strict;

my ($filename, $n) = @ARGV;
$filename
    or die "usage: $0 filename sample_size";

-f $filename
    or die "Invalid filename '$filename'";
chomp(my ($word_count_lines) = `/usr/bin/wc -l $filename`);
my ($lines, undef) = split /\s+/, $word_count_lines;

die "Need to pass in sample size"
    unless $n;
my $sample_size = int $n;

die "Invalid sample size '$n', should in the between [ 0 - $lines ]"
    unless (0 < $sample_size and $sample_size < $lines);

# Pick some random line numbers
my %sample;
while ( keys %sample < $sample_size ) {
    $sample{ 1+int rand $lines }++;
}

open my $fh, $filename
    or die "Unable to open '$filename' for reading : $!";

open my $fh_sample, "> $filename.sample"
    or die "Unable to open '$filename.sample' for writing : $!";
open my $fh_remainder, "> $filename.remainder"
    or die "Unable to open '$filename.remainder' for writing : $!";

my $current_fh;
while (<$fh>) {
    my $line_number = $.;
    $current_fh = $sample{ $line_number } ? $fh_sample : $fh_remainder;
    # Write to correct file
    print $current_fh $_;
}
close $fh
    or die "Unable to finish reading '$filename' : $!";
close $fh_sample
    or die "Unable to finish writing '$filename.sample' : $!";
close $fh_remainder
    or die "Unable to finish writing '$filename.sample' : $!";

print "Original file '$filename' has $lines rows\n";
print "Created '$filename.sample' with $sample_size rows\n";
print "Created '$filename.remainder' with " . ($lines - $sample_size) . " rows\n";
print "Run 'mv $filename.remainder $filename' if you are happy with this result\n";

运行脚本

$ perl ./sample_size.pl large_file 10

<强>输出

Original file 'large_file' has 1000000 rows
Created 'large_file.sample' with 10 rows
Created 'large_file.remainder' with 999990 rows
Run 'mv large_file.remainder large_file' if you are happy with this result