Question

我正在使用可视化工具来查看原子探测数据。

我的输出文件包含4列。每行包含原子的x，y和z坐标加上确定它是哪个原子的强度值。系统中的每个元素都有一个输出文件。

我有代码计算每个输出文件中的行数，并将其除以总计来计算系统的组成。例如，如果每个输出文件中所有行数的总和为100且我的铁原子输出文件包含85行，则系统的85％由铁原子组成。

现在，我想减少铁原子的数量，这样就可以更容易地看到其他原子。如何从输出文件中随机删除90％的行？我想做一些有条件的事情：

if ($atom>80) {      #such as iron being 85
    #randomly remove lines, perhaps with rand()
}

Answer 1

rand函数在区间[0,1]中生成实数值。如果我们想要一个在90％的时间内返回true的条件，我们可以写rand() < 0.9。因为你只想保留10％的铁原子：

my $percentage = shift @ARGV;
while (<>) {
  print unless this_record_is_iron() && rand() < $percentage;
}

然后：

$ perl reduce_iron.pl 0.9 input-data >reduced-data

如果我们想要删除90％，那么我会在整个文件中读取，创建一个指向铁记录的索引数组，随机删除索引列表，并删除除最后10％之外的所有内容：

use List::Util qw/shuffle/;
my $percentage = shift @ARGV;
my(@lines, @iron_idx);
while (<>) {
  push @lines, $_;
  push @iron_idx, $#lines if this_record_is_iron();
}
@iron_idx = (shuffle @iron_idx)[0 .. @iron_idx * $percentage - 1]; # keep indices to delete
$_ = "" for @lines[@iron_idx];
print @lines;

Answer 2

使用reservoir sampling的细化实施：

#! /usr/bin/env perl

use strict;
use warnings;

use Fcntl qw/ SEEK_SET /;

die "Usage: $0 fraction file\n" .
    "  where 1 <= fraction <= 99\n"
  unless @ARGV == 2;

my($fraction,$path) = @ARGV;
die "$0: invalid fraction: $fraction"
  unless $fraction =~ /^[0-9]+$/ && $fraction >= 1 && $fraction <= 99;

open my $fh, "<", $path or die "$0: open $path: $!";
my $lines;
++$lines while defined($_ = <$fh>);

# modified Algorithm R from Knuth's TAoCP Volume 2, pg. 144
my $rsize = my $samples = int (($lines / 100) * $fraction);
my @delete = (1 .. $samples);
foreach my $t ($samples+1 .. $lines) {
  my $m = int(rand $t) + 1;
  $delete[$m] = ++$rsize if $m <= $samples;
}

seek $fh, 0, SEEK_SET or die "$0: seek: $!";
my %delete = map +($_ => 1), @delete;
$. = 1;
while (<$fh>) {
  print unless delete $delete{$.};
}

如何随机删除文件行的某一部分？

2 个答案: