使用awk进行无需替换的采样

时间:2014-03-10 15:00:29

标签: bash shell awk

我有很多看起来像这样的文本文件:

>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT

有没有办法在不使用awk进行替换的情况下进行采样?

例如,我有这8行,我只想在新文件中随机抽样4个,而不需要替换。 输出应该如下所示:

>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT    
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT

提前致谢

4 个答案:

答案 0 :(得分:14)

对10%的线进行随机抽样怎么样?

awk 'rand()>0.9' yourfile1 yourfile2 anotherfile

我不确定“替换”是什么意思......这里没有替换,只是随机选择。

基本上,它会精确地查看每个文件的每一行,并在0到1的间隔内生成一个随机数。如果随机数大于0.9,则输出该行。所以基本上它是为每一行滚动一个10面骰子,只有当骰子出现为10时才打印它。没有机会打印两次 - 除非它在你的文件中出现两次,当然。

为了增加随机性(!),您可以按照@klashxx

的建议在开头添加srand()
awk 'BEGIN{srand()} rand()>0.9' yourfile(s)

答案 1 :(得分:3)

是的,但我不会。我会使用shufsort -R(既不是POSIX)随机化文件,然后使用n选择第一行head

如果你真的想使用awk,你需要使用rand函数,正如Mark Setchell指出的那样。

答案 2 :(得分:1)

要从文本文件中获取随机样本而不进行替换,意味着一旦随机选择(采样)一行,就无法再次选择。因此,如果要选择10行100,则十个随机行号必须是唯一的。

以下是从文本NUM生成FILE随机(无替换)样本的脚本:

#!/usr/bin/env bash
# random-samples.sh NUM FILE
# extract NUM random (without replacement) lines from FILE

num=$(( 10#${1:?'Missing sample size'} ))
file="${2:?'Missing file to sample'}"

lines=`wc -l <$file`   # max num of lines in the file

# get_sample MAX
#
# get a random number between 1 .. max
# (see the bash man page on RANDOM

get_sample() {
  local max="$1"
  local rand=$(( ((max * RANDOM) / 32767) + 1 ))
  echo "$rand"
}

# select_line LINE FILE
#
# select line LINE from FILE

select_line() {
  head -n $1 $2 | tail -1
}

declare -A samples     # keep track of samples

for ((i=1; i<=num; i++)) ; do
  sample=
  while [[ -z "$sample" ]]; do
    sample=`get_sample $lines`               # get a new sample
    if [[ -n "${samples[$sample]}" ]]; then  # already used?
      sample=                                # yes, go again
    else
      (( samples[$sample]=1 ))               # new sample, track it
    fi
  done
  line=`select_line $sample $file`           # fetch the sampled line
  printf "%2d: %s\n" $i "$line"
done
exit

以下是一些调用的输出:

./random-samples.sh 10 poetry-samples.txt
 1: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
 2: 25. Hope springs eternal in the human breast 1,080,000 Alexander Pope
 3: 43. The moving finger writes; and, having writ,/Moves on571,000 Edward Fitzgerald
 4: 5. And miles to go before I sleep 5,350,000 Robert Frost
 5: 6. Not with a bang but a whimper 5,280,000 T.S. Eliot
 6: 40. In Xanadu did Kubla Khan 594,000 Coleridge
 7: 41. The quality of mercy is not strained 589,000 Shakespeare
 8: 7. Tread softly because you tread on my dreams 4,860,000 W.B. Yeats
 9: 42. They also serve who only stand and wait 584,000 Milton
10: 48. If you can keep your head when all about you 447,000Kipling

./random-samples.sh 10 poetry-samples.txt
 1: 38. Shall I compare thee to a summers day 638,000 Shakespeare
 2: 34. Busy old fool, unruly sun 675,000 John Donne
 3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
 4: 45. We few, we happy few, we band of brothers 521,000Shakespeare
 5: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
 6: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
 7: 46. If music be the food of love, play on 507,000 Shakespeare
 8: 44. What is this life if, full of care,/We have no time to stand and stare 528,000 W.H. Davies
 9: 35. Do not go gentle into that good night 665,000 Dylan Thomas
10: 15. But at my back I always hear 2,010,000 Marvell

./random-samples.sh 10 poetry-samples.txt
 1: 26. I think that I shall never see/A poem lovely as a tree. 1,080,000 Joyce Kilmer
 2: 32. Human kind/Cannot bear very much reality 891,000 T.S. Eliot
 3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
 4: 13. My mistress’ eyes are nothing like the sun 2,230,000Shakespeare
 5: 42. They also serve who only stand and wait 584,000 Milton
 6: 24. When in disgrace with fortune and men's eyes 1,100,000Shakespeare
 7: 21. A narrow fellow in the grass 1,310,000 Emily Dickinson
 8: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
 9: 10. Tis better to have loved and lost/Than never to have loved at all 2,400,000 Tennyson
10: 31. O Romeo, Romeo; wherefore art thou Romeo 912,000Shakespeare

答案 3 :(得分:0)

使用固定模式对文件进行采样可能更好,例如每10行采样一条记录。您可以使用此awk单行代码

执行此操作
awk '0==NR%10' filename

如果您想对总数的百分比进行采样,那么您可以编制一种方法来计算awk单行应使用的行数,以便打印的记录数与该数量/百分比相匹配。

我希望这有帮助!