我尝试使用shuf
来对文件进行随机播放,但这花费的时间太长了;该进程被托管管理员杀死。我有最便宜的Linux Bluehost计划。
shuf MMM.csv
文件有44M行,文件大小为7439641823字节,使用sort -R
更糟糕,考虑将文件拆分成44个文件,但它不是很随机,任何想法都会非常感激
我想要的是随机播放文件,然后提取前10000行
文件已经过排序,出于商业原因,10000行无法排序
答案 0 :(得分:3)
关键是要使用" shuf"使用-n(“输出最多COUNT行”)选项。
比较
$ time (seq 1 44000000 | shuf > /tmp/shuffled)
user 0m58.234s
sys 0m4.394s
$ time (seq 1 44000000 | shuf -n 10000 > /tmp/shuffled)
user 0m25.493s
sys 0m1.771s
(这些时间是在一台可悲的旧款2.53GHz Mac上拍摄的。)
注意:在某些环境中," shuf"可能是" gshuf"。
答案 1 :(得分:2)
鉴于您要求从文件中打印一些固定数量的随机行:
$ cat tst.awk
NR==1 {
srand()
for (i=1;i<=outNum;i++) {
if (tgts[int(rand()*inNum)+1]++) {
i--
}
}
}
NR in tgts
$ seq 44000000 > file44m
$ time awk -v inNum=$(wc -l < file44m) -v outNum=10000 -f tst.awk file44m > file10k
real 0m17.676s
user 0m17.238s
sys 0m0.404s
$ sort -u file10k | wc -l
10000
以上只在内存中存储outNum
行号,因此不存在内存问题。请参阅下文,了解它在小文件中的工作原理:
$ cat file
1
2
3
4
5
6
7
8
9
10
$ awk -v inNum=$(wc -l < file) -v outNum=4 -f tst.awk file
6
8
9
10
$ awk -v inNum=$(wc -l < file) -v outNum=4 -f tst.awk file
1
6
7
9
$ awk -v inNum=$(wc -l < file) -v outNum=3 -f tst.awk file
3
7
8
$ awk -v inNum=$(wc -l < file) -v outNum=3 -f tst.awk file
4
5
6
答案 2 :(得分:1)
我决定使用:
perl -ne 'print if (rand() < .001)' MMM.csv > MMM.out
从中获取10000的子集
但我仍然希望解决方案在10秒内将一个44M行文件洗牌,这在共享主机帐户上是否可行?
答案 3 :(得分:0)
我在大约一年前写过这篇文章。请告诉我你的考试成绩如何。如果您的输入已排序但您不希望对结果进行排序,那么请输出输出。
在我看来,您对输出大小的输入足够大,您将希望将渐进式跳过部分取消注释并正常工作。我不需要它,因此我记得这是未经测试的。
#! /usr/bin/awk -F
# reservoir_sample.awk
# the basic reservoir algorithm is due to Alan Waterman (according to Knuth)
# Vitter (85) improved timing and made sampling uniform
# http://www.cs.umd.edu/~samir/498/vitter.pdf
#
# Expect a K parameter which is the size of the intended sample
#
# reservoir_sample.awk -v K=1000 population.list
BEGIN {
# give srand a fixed seed for reproducubility
# or a variable for diversity
srand(systime() + PROCINFO["pid"]);
# 23 is a magic number between 10 & 40 as per Vitter
threshold = 23 * K;
}
# fill the reservoir
NR <= K { reservoir[NR] = $0 }
# replace item in resovior with current item
# on probability of K / (NR)
NR > K {
uniform_probability = int(rand() * NR + 0.5);
if (uniform_probability <= K)
reservoir[uniform_probability] = $0
# when the population is large with respect to sample size K
# test fewer items from the population for inclusion in the reservoir
#if(NR < threshold){
# skip =
# for (i=0 ; i < skip; i++) getline()
#}
}
# and Bobs yer uncle
END { for(item in reservoir) {print reservoir[item]} }
# without the progressivly larger steps through the population
# this would not approach uniform since for larger the populations
# more earlier candidates are replaced more than later candidates.