从文件中选择随机行

时间:2012-02-12 01:27:57

标签: bash shell random text-processing

在Bash脚本中,我想从输入文件中挑出N个随机行并输出到另一个文件。

如何做到这一点?

6 个答案:

答案 0 :(得分:441)

使用shuf-n选项,如下所示,获取N随机行:

shuf -n N input > output

答案 1 :(得分:144)

随机对文件排序并选择第一个100行:

$ sort -R input | head -n 100 >output

答案 2 :(得分:20)

好吧,根据对这个简短回答的评论,他在一分钟之内就删掉了78000亿行。

接受挑战...

编辑:我打破了自己的记录

powershuf在0.047秒内完成了

    String ManagersPath = @"C:\Users\Name\Visual Studios Project Custom Files\Employee Id's\Managers\Manager_Ids.txt"; //Path To Manager Logins

String EnteredEmployeeId;

private void textBox1_TextChanged(object sender, EventArgs e)
{

}

private void Employee_Id_TextBox_KeyPress(object sender, KeyPressEventArgs e)
{
    if (!char.IsControl(e.KeyChar) && !char.IsDigit(e.KeyChar) &&           //Checks Characters entered are Numbers Only and allows them
        (e.KeyChar != '0'))
            {
                e.Handled = true;
            }
    else if (e.KeyChar == (char)13)                                         //Checks if The "Enter" Key is pressed
    {
        EnteredEmployeeId = Employee_Id_TextBox.Text;                         //Assigns EnteredEmployeeId To the Entered Numbes In Text Box           


        bool result = IsNumberInFile(EnteredEmployeeId, "ManagerLoginId" , ManagersPath) 
        if(result)
           //User is in file
        else
          //User is not in file
    }
}

之所以如此之快,是因为我没有读取整个文件,只是将文件指针移动了10次,并在指针之后打印了一行。

Gitlab Repo

旧尝试

首先,我需要一个78.000.000.000行的文件:

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

这给我一个文件,其中包含 780亿换行符;-)

现在是shuf部分:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

瓶颈是CPU,并且不使用多个线程,它以100%固定1个内核,而其余15个则未被使用。

我经常使用Python,所以我将使用它来使速度更快:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

这让我不到一分钟:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

我在使用i9和Samsung NVMe的Lenovo X1 Extreme 2nd Gen上做到了这一点,

我知道它可以变得更快,但我会留一些空间让其他人尝试一下。

行计数器source: Luther Blissett

答案 3 :(得分:2)

我的首选选项非常快,我采样了一个制表符分隔的数据文件,该文件包含13列,2310万行,2.0GB未压缩。

interface MyCrazyType {
   [key ????]: any;
}

答案 4 :(得分:0)

seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'

答案 5 :(得分:0)

# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"