更新1

Question

如何从非常大的文件中获取n随机行。

如果我可以在随机化之前或之后添加过滤器，那也很棒。

更新1

在我的情况下，规格是：

＆GT; 1亿行
＆GT; 10GB文件
通常随机批量大小10000-30000
512RAM托管的ubuntu服务器14.10

所以从文件中丢失几行不会是一个大问题，因为他们有1/100的机会，但性能和资源消耗将是一个问题

Answer 1

这是一个小小的bash函数。正如你所说，它抓住了一批“批处理”的行，在文件中有一个随机起点。

randline() {
  local lines c r _

  # cache the number of lines in this file in a symlink in the temp dir
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Pick a random number...
  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
  echo "start=$r" >&2

  # And start displaying $2 lines before that number.
  head -n $r "$1" | tail -n ${2:-1}
}

根据需要修改echo行。

此解决方案的优点是管道更少，管道资源更少（即没有| sort ... |），平台依赖性更低（即没有sort -R，这是特定于GNU排序的。）

请注意，这取决于Bash的$RANDOM变量，该变量实际上可能是也可能不是随机变量。此外，如果您的源文件包含超过32768 ^ 2行，它将错过行，如果您指定的行数（N）> 1且随机，则会出现故障边缘情况起始点小于N行。解决这个问题留给读者练习。：）

更新＃1：

mklement0在关于head ... | tail ...方法的潜在性能问题的评论中提出了一个很好的问题。老实说，我不知道答案，但我希望head和tail都得到充分优化，以便在显示输出之前不会缓冲所有输入。

关于我的希望没有实现的可能性，这里有另一种选择。这是一个基于awk的“滑动窗口”尾部。我会将它嵌入到我编写的早期函数中，以便您可以根据需要进行测试。

randline() {
  local lines c r _

  # Line count cache, per the first version of this function...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]

  echo "start=$r" >&2

  # This simply pipes the functionality of the `head | tail` combo above
  # through a single invocation of awk.
  # It should handle any size of input file with the same load/impact.
  awk -v lines=${2:-1} -v count=0 -v start=$r '
    NR < start { next; }
    { out[NR]=$0; count++; }
    count > lines { delete out[start++]; count--; }
    END {
      for(i=start;i<start+lines;i++) {
        print out[i];
      }
    }
  ' "$1"
}

嵌入式awk脚本替换了以前版本的函数中的head ... | tail ...管道。它的工作原理如下：

它会跳过行，直到早先随机化确定的“开始”。
它将当前行记录到数组中。
如果数组大于我们要保留的行数，则会删除第一条记录。
在文件末尾，它会打印录制的数据。

结果是awk进程不应增加其内存占用量，因为输出数组的修剪速度与构建时一样快。

注意：我实际上没有用您的数据对此进行测试。

更新＃2：

Hrm，随着你的问题的更新，你想要N个随机线而不是从随机点开始的一行线，我们需要一个不同的策略。您施加的系统限制非常严重。以下可能是一个选项，也使用awk，随机数仍来自Bash：

randlines() {
  local lines c r _

  # Line count cache...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Create a LIST of random numbers, from 1 to the size of the file ($c)
  for (( i=0; i<$2; i++ )); do
    echo $[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) + 1 ]
  done | awk '
    # And here inside awk, build an array of those random numbers, and
    NR==FNR { lines[$1]; next; }
    # display lines from the input file that match the numbers.
    FNR in lines
  ' - "$1"
}

这是通过将随机行号列表作为“第一”文件提供给awk，然后从“第二”文件中获取awk打印行，其行号包含在“第一”文件中。它使用wc来确定要生成的随机数的上限。这意味着你将两次阅读这个文件。如果您有另一个源文件中的行数（例如数据库），请在此处插入。：）

限制因素可能是第一个文件的大小，必须将其加载到内存中。我相信30000个随机数应该只占用大约170KB的内存，但是数组如何在RAM中表示取决于你正在使用的awk的实现。（虽然通常，awk实现（包括Ubuntu中的Gawk）非常擅长将内存浪费降至最低。）

这对你有用吗？

Answer 2

在这些限制因素中，以下方法会更好。

在文件中寻找随机位置（例如，您将在＆＃34;内部＆＃34;在某些行中）
从此位置向后移动并找到给定行的开头
前进并打印整行

为此，您需要一个可以在文件中搜索的工具，例如perl。

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

将上述内容保存到某些文件中，例如＆＃34; randlines.pl＆＃34;并将其用作：

perl randlines.pl wanted_count_of_lines file_name

e.g。

perl randlines.pl 10000 ./BIGFILE

该脚本执行非常低级别的IO操作，即非常快。（在我的笔记本上，从10M选择30k行需要半秒钟。）

Answer 3

简单（但缓慢）的解决方案

n=15 #number of random lines
filter_before | sort -R | head -$n | filter_after

#or, if you could have duplicate lines
filter_before | nl | sort -R | cut -f2- | head -$n | filter_after
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

或者如果您愿意，请将以下内容保存到randlines脚本

中

#!/bin/bash
nl | sort -R | cut -f2 | head -"${1:-10}"

并将其用作：

filter_before | randlines 55 | filter_after   #for 55 lines

工作原理：

sort -R按每个行的计算随机哈希值对文件进行排序，因此您将得到一个随机的行顺序，因此前N行是随机行< / em>的

因为散列为同一行生成相同的散列，所以重复行不会被视为不同。可以消除添加行号的重复行（使用nl），因此排序将永远不会完全重复。在sort删除添加的行号后。

示例：

seq -f 'some line %g' 500 | nl | sort -R | cut -f2- | head -3

在后续运行中打印：

some line 65 some line 420 some line 290 some line 470 some line 226 some line 132 some line 433 some line 424 some line 196

带有重复行的演示：

yes 'one two' | head -10 | nl | sort -R | cut -f2- | head -3

在后续运行中打印：

one two two one two one one one two

最后，如果您想要使用，而不是cut sed：

sed -r 's/^\s*[0-9][0-9]*\t//'

Answer 4

#!/bin/bash
#contents of bashScript.sh

file="$1";
lineCnt=$2;
filter="$3";
nfilter="$4";
echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;

totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
echo "filtered count : $totalLineCnt" 1>&2;

chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
echo "chances : $chances" 1>&2;

cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";

用法：

获得1000个随机样本

bashScript.sh /path/to/largefile.txt 1000

行有数字

bashScript.sh /path/to/largefile.txt 1000 "[0-9]"

no mike and jane

bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"

Answer 5

我已使用rl进行行随机化，发现它表现得非常好。不确定它如何扩展到你的情况（你只需要做rl FILE | head -n NUM）。你可以在这里得到它：http://arthurdejong.org/rl/

从bash中的大文件中获取随机行

更新1

5 个答案:

用法：