Question

以下bash代码从一个输入文件中逐行读取并写入大量（~100）输出文件，表现出不合理的性能 - 对于10,000行，在30秒的范围内，我希望它可用于数百万或数十亿行输入。

在以下代码中，batches是已定义的关联数组（在其他语言中，是地图）。

如何改进？

while IFS='' read -r line
do
    x=`echo "$line" | cut -d"   " -f1`;
    y=`echo "$line" | cut -d"   " -f2`;
#   echo "find match between $x and $y";
    a="${batches["$x"]}";
    b="${batches["$y"]}";
    if [ -z $a ] && [ -n $b ]
        then echo "$line" >> Output/batch_$b.txt;
    elif [ -n $a ] && [ -z $b ]
        then echo "$line" >> Output/batch_$a.txt;
    elif [ -z $a ] && [ -z $b ]
            then echo "$line" >> Output/batch_0.txt;
    elif [ $a -gt $b ]
        then echo "$line" >> Output/batch_$a.txt;
    elif [ $a -le $b ]
            then echo "$line" >> Output/batch_$b.txt;
    fi

done < input.txt

Answer 1

while IFS= read -r line; do
   x=${line%%$'\t'*}; rest=${line#*$'\t'}
   y=${rest%%$'\t'*}; rest=${rest#*$'\t'}
   ...
done <input.txt

这样，每次想要将line分成x和y时，您都不会启动两个外部程序。

在正常情况下，您可以使用read隐式地通过将列读入不同的字段来进行字符串拆分，但是当read修剪前导空格时，如果（作为{as}你的列是以空格分隔的，第一列可以是空的;因此，使用参数扩展是必要的。有关参数扩展如何工作的详细信息，请参阅BashFAQ #73;有关使用bash本机工具进行字符串操作的一般性介绍，请参阅BashFAQ #100。

此外，每次要向它们写一行时重新打开输出文件对于这种卷来说是愚蠢的。要么使用awk，它会自动为你处理，或者写一个帮助器（请注意以下内容需要一个相当新的bash版本 - 可能是4.2）：

write_to_file() {
    local filename content new_out_fd
    filename=$1; shift
    printf -v content '%s\t' "$@"
    content=${content%$'\t'}

    declare -g -A output_fds
    if ! [[ ${output_fds[$filename]} ]]; then
      exec {new_out_fd}>"$filename"
      output_fds[$filename]=$new_out_fd
    fi
    printf '%s\n' "$content" >&"${output_fds[$filename]}"
}

......然后：

if [[ $a && ! $b ]]; then
    write_to_file "Output/batch_$a.txt" "$line"
elif [[ ! $a ]] && [[ $b ]]; then
    write_to_file "Output/batch_$b.txt" "$line"
elif [[ ! $a ]] && [[ ! $b ]]; then
    write_to_file "Output/batch_0.txt" "$line"
elif (( a > b )); then
    write_to_file "Output/batch_$a.txt" "$line"
else
    write_to_file "Output/batch_$b.txt" "$line"
fi

请注意，缓存FD仅在有足够的输出文件时才有意义，您可以为每个文件维护打开的文件描述符（并且重新打开接收多个写入的文件是一个净收益）。如果它对您没有意义，请随意将其删除，并且只进行更快的字符串拆分。

只是为了完成，这是另一种方法（也使用自动fd管理编写，因此需要bash 4.2） - 运行两个剪切调用并让它们都运行整个输入文件。

exec {x_columns_fd}< <(cut -d"   " -f1 <input.txt)
exec {y_columns_fd}< <(cut -d"   " -f2 <input.txt)
while IFS='' read -r line && \
      IFS='' read -r -u "$x_columns_fd" x && \
      IFS='' read -r -u "$y_columns_fd" y; do
  ...
done <input.txt

这是有效的，因为它不是cut本身效率低下 - 它启动它，运行它，读取它的输出并一直关闭它的成本。如果您只运行两个剪切副本，并让每个副本处理整个文件，那么性能就可以了。

优化IO密集shell脚本的性能

1 个答案: