Question

所以我有这个语法的脚本：

./script number file

数字，我想从文件文件获取的行数。这些线是随机选择的，然后打印两次。考虑到一个非常大的文件~1 000 000行，这个算法运行得太慢。我不知道为什么因为打印只是访问数组。

#!/bin/bash

max=`wc -l $2 | cut -d " " -f1`

users=(`shuf -i 0-$max -n $1`)
pages=(`shuf -i 0-$max -n $1`)

readarray lines < $2

for (( i = 0; i < $1; i++ )); do
    echo L ${lines[${users[i]}]} ${lines[${pages[i]}]} 
done

for (( i = 0; i < $1; i++ )); do
    echo U ${lines[${users[i]}]} ${lines[${pages[i]}]} 
done

Answer 1

只需使用shuf选择行，这就是它的设计目标。例如（见注释）：

readarray users < <(shuf -n $1 "$2")
readarray pages < <(shuf -n $1 "$2")
for (( i = 0; i < $1; i++ )); do
    echo L ${users[i]} ${pages[i]} 
done
for (( i = 0; i < $1; i++ )); do
    echo U ${users[i]} ${pages[i]} 
done

这仍然会很慢，因为shuf需要读取整个文件才能找到行尾，并且你要调用它两次，但它可能比读取整个文件更快内存作为bash数组，特别是如果你没有足够的内存。（如果脚本的第二个参数不是常规文件，它也将无法工作;如果是管道，则无法读取它两次。）

你可以通过一次选择两组线然后在users和pages之间划分它们来加快速度，但是你需要做一些工作来获得无偏见的分布，假设你关心这一点。

注1：

正如@gniourf_gniourf在评论中所指出的那样，通过使用-t readarray选项然后引用echo的参数，您可以更准确地呈现这些行。另外，mapfile是readarray的首选名称：

mapfile -t users < <(shuf -n $1 "$2")
mapfile -t pages < <(shuf -n $1 "$2")
for (( i = 0; i < $1; i++ )); do
    echo L "${users[i]}" "${pages[i]}" 
done
for (( i = 0; i < $1; i++ )); do
    echo U "${users[i]}" "${pages[i]}"
done

注2：

如果$1很大，你最好不要使用数组。这是一个可能的解决方案：

lines="$(paste -d' ' <(shuf -n $1 "$2") <(shuf -n $1 "$"))"
sed 's/^/L /' <<<"$lines"
sed 's/^/U /' <<<"$lines"

Answer 2

也许你可以完全没有数组，只需使用文件实用程序和临时文件：

# Put the shuf outputs in two separate files:

shuf -n "$1" "$2" > shuf_users
shuf -n "$1" "$2" > shuf_pages

# paste the two:
paste -d ' ' shuf_users shuf_pages | sed 's/^/L /'
paste -d ' ' shuf_pages shuf_users | sed 's/^/U /'

在@ rici的解决方案中，罪魁祸首也可能在两个输出行的循环中（这种for循环非常慢）。

您应该使用mktemp来创建临时文件shuf_users和shuf_pages。这项练习留待读者阅读。

Answer 3

以下应该可以很快地完成你想要的事情，bash数组很慢并且使用临时文件构建，所以你的性能应该没有更好的使用它们 - 如果它们是由Bash维护者正确实现的话它们将是一个不错的功能但它们是还没到那里：

File (make sure to name it the same, this is recursive): ranlines.bsh

#!/bin/bash
declare -i max=$(wc -l $2 | cut -d " " -f1)+1
declare STR=""
declare -i random_line=0
declare tmp_file="/tmp/_$$_$(date)"
declare -r usr_file="/tmp/_user_3434"
declare -r pgs_file="/tmp/_pgs_4343"

## create our tmp_file and tell it dont use 0 
echo "0" >> "$tmp_file" 

for (( i = 0; i < $1; i++ )); do
 while :; do 
   random_line=$(($RANDOM*30%$max));
   ## if you find an entry already in the tmp_file then continue 
   ## get a new number, loop until you find a new number
   (($(grep -c "$random_line" "$tmp_file"))) && continue;
   echo "$random_line" >> "$tmp_file" 
   break; 
 done 
 ## build the sed print string
 STR="$STR${random_line}p;"
done
rm "$tmp_file" 

if [[ $# -eq 2 ]]; then 
 #usr_file
 eval "sed -n '$STR' $2" > "$usr_file" 
 ## call us again, this time for the U 
 ranlines.bsh $1 $2 "U"
else 
 ## we know already we are processing the U because args is not 2 
 declare -i random_slct=$1+1
 eval "sed -n '$STR' $2" > "$pgs_file" 
 paste <(sed -n "${random_slct}q; a L" "$2") "$usr_file" "$pgs_file"
 paste <(sed -n "${random_slct}q; a U" "$2") "$pgs_file" "$usr_file"
 rm "$pgs_file" "$usr_file"
fi   
exit 0

在BASH中选择文件中的随机行需要太长时间

3 个答案: