Question

我有一个巨大的文件（数百万行）。我想从中获取一个随机样本，我已经生成了一个独特的随机数列表，现在我想获得所有行号与我生成的随机数匹配的行。

对随机数进行排序不是问题，因此我认为我可以在连续数字之间取差异，只需将光标与文件中的光标区分开来。

我想我应该使用sed或awk。

Answer 1

为什么不直接使用shuf来获取随机行：

shuf -n NUMBER_OF_LINES file

$ seq 100 >a   # the file "a" contains number 1 to 100, each one in a line

$ shuf -n 4 a
54
46
30
53

$ shuf -n 4 a
50
37
63
21

更新

我可以以某种方式存储shuf选择的行数吗？ - 皮奥

shuf -i 1-1000 -n 5 > rand_numbers # store the list of numbers
awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' list_of_numbers a #print those lines

Answer 2

您可以使用awk和shuf：

shuf file.txt > shuf.txt
awk '!a[$0]++' shuf.txt > uniqed.txt

此awk是删除重复项的最佳工具。