我有很多看起来像这样的文本文件:
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT
有没有办法在不使用awk进行替换的情况下进行采样?
例如,我有这8行,我只想在新文件中随机抽样4个,而不需要替换。 输出应该如下所示:
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
提前致谢
答案 0 :(得分:14)
对10%的线进行随机抽样怎么样?
awk 'rand()>0.9' yourfile1 yourfile2 anotherfile
我不确定“替换”是什么意思......这里没有替换,只是随机选择。
基本上,它会精确地查看每个文件的每一行,并在0到1的间隔内生成一个随机数。如果随机数大于0.9,则输出该行。所以基本上它是为每一行滚动一个10面骰子,只有当骰子出现为10时才打印它。没有机会打印两次 - 除非它在你的文件中出现两次,当然。
为了增加随机性(!),您可以按照@klashxx
的建议在开头添加srand()
awk 'BEGIN{srand()} rand()>0.9' yourfile(s)
答案 1 :(得分:3)
是的,但我不会。我会使用shuf
或sort -R
(既不是POSIX)随机化文件,然后使用n
选择第一行head
。
如果你真的想使用awk
,你需要使用rand
函数,正如Mark Setchell指出的那样。
答案 2 :(得分:1)
要从文本文件中获取随机样本而不进行替换,意味着一旦随机选择(采样)一行,就无法再次选择。因此,如果要选择10行100,则十个随机行号必须是唯一的。
以下是从文本NUM
生成FILE
随机(无替换)样本的脚本:
#!/usr/bin/env bash
# random-samples.sh NUM FILE
# extract NUM random (without replacement) lines from FILE
num=$(( 10#${1:?'Missing sample size'} ))
file="${2:?'Missing file to sample'}"
lines=`wc -l <$file` # max num of lines in the file
# get_sample MAX
#
# get a random number between 1 .. max
# (see the bash man page on RANDOM
get_sample() {
local max="$1"
local rand=$(( ((max * RANDOM) / 32767) + 1 ))
echo "$rand"
}
# select_line LINE FILE
#
# select line LINE from FILE
select_line() {
head -n $1 $2 | tail -1
}
declare -A samples # keep track of samples
for ((i=1; i<=num; i++)) ; do
sample=
while [[ -z "$sample" ]]; do
sample=`get_sample $lines` # get a new sample
if [[ -n "${samples[$sample]}" ]]; then # already used?
sample= # yes, go again
else
(( samples[$sample]=1 )) # new sample, track it
fi
done
line=`select_line $sample $file` # fetch the sampled line
printf "%2d: %s\n" $i "$line"
done
exit
以下是一些调用的输出:
./random-samples.sh 10 poetry-samples.txt
1: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
2: 25. Hope springs eternal in the human breast 1,080,000 Alexander Pope
3: 43. The moving finger writes; and, having writ,/Moves on571,000 Edward Fitzgerald
4: 5. And miles to go before I sleep 5,350,000 Robert Frost
5: 6. Not with a bang but a whimper 5,280,000 T.S. Eliot
6: 40. In Xanadu did Kubla Khan 594,000 Coleridge
7: 41. The quality of mercy is not strained 589,000 Shakespeare
8: 7. Tread softly because you tread on my dreams 4,860,000 W.B. Yeats
9: 42. They also serve who only stand and wait 584,000 Milton
10: 48. If you can keep your head when all about you 447,000Kipling
./random-samples.sh 10 poetry-samples.txt
1: 38. Shall I compare thee to a summers day 638,000 Shakespeare
2: 34. Busy old fool, unruly sun 675,000 John Donne
3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
4: 45. We few, we happy few, we band of brothers 521,000Shakespeare
5: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
6: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
7: 46. If music be the food of love, play on 507,000 Shakespeare
8: 44. What is this life if, full of care,/We have no time to stand and stare 528,000 W.H. Davies
9: 35. Do not go gentle into that good night 665,000 Dylan Thomas
10: 15. But at my back I always hear 2,010,000 Marvell
./random-samples.sh 10 poetry-samples.txt
1: 26. I think that I shall never see/A poem lovely as a tree. 1,080,000 Joyce Kilmer
2: 32. Human kind/Cannot bear very much reality 891,000 T.S. Eliot
3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
4: 13. My mistress’ eyes are nothing like the sun 2,230,000Shakespeare
5: 42. They also serve who only stand and wait 584,000 Milton
6: 24. When in disgrace with fortune and men's eyes 1,100,000Shakespeare
7: 21. A narrow fellow in the grass 1,310,000 Emily Dickinson
8: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
9: 10. Tis better to have loved and lost/Than never to have loved at all 2,400,000 Tennyson
10: 31. O Romeo, Romeo; wherefore art thou Romeo 912,000Shakespeare
答案 3 :(得分:0)
使用固定模式对文件进行采样可能更好,例如每10行采样一条记录。您可以使用此awk
单行代码
awk '0==NR%10' filename
如果您想对总数的百分比进行采样,那么您可以编制一种方法来计算awk
单行应使用的行数,以便打印的记录数与该数量/百分比相匹配。
我希望这有帮助!