如何从4400万行文件中选择1万个随机行

时间:2016-01-22 16:21:55

标签: linux bash perl sorting awk

我尝试使用shuf来对文件进行随机播放,但这花费的时间太长了;该进程被托管管理员杀死。我有最便宜的Linux Bluehost计划。

shuf MMM.csv

文件有44M行,文件大小为7439641823字节,使用sort -R更糟糕,考虑将文件拆分成44个文件,但它不是很随机,任何想法都会非常感激

我想要的是随机播放文件,然后提取前10000行

文件已经过排序,出于商业原因,10000行无法排序

4 个答案:

答案 0 :(得分:3)

关键是要使用" shuf"使用-n(“输出最多COUNT行”)选项。

比较

$ time (seq 1 44000000 | shuf > /tmp/shuffled)
user  0m58.234s
sys   0m4.394s

$ time (seq 1 44000000 | shuf -n 10000 > /tmp/shuffled)
user   0m25.493s
sys    0m1.771s

(这些时间是在一台可悲的旧款2.53GHz Mac上拍摄的。)

注意:在某些环境中," shuf"可能是" gshuf"。

答案 1 :(得分:2)

鉴于您要求从文件中打印一些固定数量的随机行:

$ cat tst.awk
NR==1 {
    srand()
    for (i=1;i<=outNum;i++) {
        if (tgts[int(rand()*inNum)+1]++) {
            i--
        }
    }
}
NR in tgts

$ seq 44000000 > file44m

$ time awk -v inNum=$(wc -l < file44m) -v outNum=10000 -f tst.awk file44m > file10k
real    0m17.676s
user    0m17.238s
sys     0m0.404s

$ sort -u file10k | wc -l
10000

以上只在内存中存储outNum行号,因此不存在内存问题。请参阅下文,了解它在小文件中的工作原理:

$ cat file
1
2
3
4
5
6
7
8
9
10

$ awk -v inNum=$(wc -l < file) -v outNum=4 -f tst.awk file
6
8
9
10

$ awk -v inNum=$(wc -l < file) -v outNum=4 -f tst.awk file
1
6
7
9

$ awk -v inNum=$(wc -l < file) -v outNum=3 -f tst.awk file
3
7
8

$ awk -v inNum=$(wc -l < file) -v outNum=3 -f tst.awk file
4
5
6

答案 2 :(得分:1)

我决定使用:

perl -ne 'print if (rand() < .001)' MMM.csv > MMM.out

从中获取10000的子集

但我仍然希望解决方案在10秒内将一个44M行文件洗牌,这在共享主机帐户上是否可行?

答案 3 :(得分:0)

我在大约一年前写过这篇文章。请告诉我你的考试成绩如何。如果您的输入已排序但您不希望对结果进行排序,那么请输出输出。

在我看来,您对输出大小的输入足够大,您将希望将渐进式跳过部分取消注释并正常工作。我不需要它,因此我记得这是未经测试的。

#! /usr/bin/awk -F

# reservoir_sample.awk

# the basic reservoir algorithm is due to Alan Waterman (according to Knuth) 
# Vitter (85) improved timing and made sampling uniform
# http://www.cs.umd.edu/~samir/498/vitter.pdf
# 
# Expect a K parameter which is the size of the intended sample
# 
# reservoir_sample.awk -v K=1000   population.list


BEGIN {
    # give srand a fixed seed for reproducubility
    # or a variable for diversity    
    srand(systime() + PROCINFO["pid"]); 

    # 23 is a magic number between 10 & 40 as per Vitter  
    threshold = 23 * K; 
}

# fill the reservoir
NR <= K {   reservoir[NR] = $0 }

# replace item in resovior with current item 
# on probability of K / (NR) 
NR > K {
    uniform_probability = int(rand() * NR + 0.5);
    if (uniform_probability  <= K) 
        reservoir[uniform_probability] = $0

    # when the population is large with respect to sample size K
    # test fewer items from the population for inclusion in the reservoir   
    #if(NR < threshold){
    #   skip = 
    #   for (i=0 ; i < skip; i++) getline()  
    #}
}

# and Bobs yer uncle
END {   for(item in reservoir) {print reservoir[item]} }

# without the progressivly larger steps through the population
# this would not approach uniform since for larger the populations 
# more earlier candidates are replaced more than later candidates.