Question

上下文

我需要使用＆＃39; sort -u＆＃39;来优化重复数据删除。我的linux机器有一个旧的实现＆＃39; sort＆＃39;命令（即5.97）没有＆＃39; - 并行＆＃39;选项。虽然＆＃39;排序＆＃39;实现可并行化的算法（例如merge-sort），我需要明确这样的并行化。因此，我通过＆＃39; xargs＆＃39;手工制作。命令优于〜2.5X w.r.t.单一排序-u＆＃39;方法......什么时候工作正常。

这里是我正在做的事情的直觉。

我正在运行一个bash脚本，它将输入文件（例如file.txt）分成几个部分（例如file.txt.part1，file.txt.part2，file.txt.part3，file.txt.part4）。由此产生的部分将传递给＆＃39; xargs＆＃39;命令以通过sortu.sh脚本执行并行重复数据删除（最后的详细信息）。 sortu.sh包含了对＆＃39; sort -u＆＃39;的调用。并输出结果文件名（例如＆＃34; sortu.sh file.txt.part1＆＃34; outputs＆＃34; file.txt.part1.sorted＆＃34;）。然后将得到的分类部分传递给＆＃39; sort --merge -u＆＃39;假设已经对这些部分进行了排序，它会合并/重复删除输入部分。

我遇到的问题是通过＆＃39; xargs＆＃39;进行并行化。这是我的代码的简化版本：

 AVAILABLE_CORES=4
 PARTS="file.txt.part1
 file.txt.part2
 file.txt.part3
 file.txt.part4"

 SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \
                                      --max-procs=$AVAILABLE_CORES \
                                      bash sortu.sh \
               )
 ...
 #More code for merging the resulting parts $SORTED_PARTS
 ...

期望的结果是变量SORTED_PARTS中的已排序部分列表：

 echo "$SORTED_PARTS"
 file.txt.part1.sorted
 file.txt.part2.sorted
 file.txt.part3.sorted
 file.txt.part4.sorted

症状

然而，（有时）有一个缺失的分类部分。例如，file.txt.part2.sorted：

 echo "$SORTED_PARTS"
 file.txt.part1.sorted
 file.txt.part3.sorted
 file.txt.part4.sorted

此症状在其发生时是不确定的（即，相同file.txt的执行成功并且在另一次失败时）或在丢失的文件中（即，它并不总是相同的排序缺失部分）。

问题

我有一个race condition，其中所有sortu.sh实例都已完成，并且＆＃39; xargs＆＃39;在刷新标准输出之前发送EOF。

问题

有没有办法在xagrs＆＃39;之前确保冲洗。发送EOF？

约束

我无法同时使用parallel命令和＆＃34; - 并行＆＃34; sort命令的选项。

sortu.sh代码

 #!/bin/bash

 SORTED=$1.sorted
 sort -u $1 > $SORTED
 echo $SORTED

Answer 1

以下内容根本不会将内容写入磁盘，并将拆分进程，排序进程和合并并行化，同时执行所有这些操作。

此版本已被移植到bash 3.2;为较新版本的bash构建的版本不需要eval。

#!/bin/bash

nprocs=5  # maybe call nprocs command instead?
fd_min=10 # on bash 4.1, can use automatic FD allocation instead

# create a temporary directory; delete on exit
tempdir=$(mktemp -d "${TMPDIR:-/tmp}/psort.XXXXXX")
trap 'rm -rf "$tempdir"' 0

# close extra FDs and clear traps, before optionally executing another tool.
#
# Doing this in subshells ensures that only the main process holds write handles on the
# individual sorts, so that they exit when those handles are closed.
cloexec() {
    local fifo_fd
    for ((fifo_fd=fd_min; fifo_fd < (fd_min+nprocs); fifo_fd++)); do
        : "Closing fd $fifo_fd"
        # in modern bash; just: exec {fifo_fd}>&-
        eval "exec ${fifo_fd}>&-"
    done
    if (( $# )); then
        trap - 0
        exec "$@"
    fi
}

# For each parallel process:
# - Run a sort -u invocation reading from an FD and writing from a FIFO
# - Add the FIFO's name to a merge sort command
merge_cmd=(sort --merge -u)
for ((i=0; i<nprocs; i++)); do
  mkfifo "$tempdir/fifo.$i"               # create FIFO
  merge_cmd+=( "$tempdir/fifo.$i" )       # add to sort command line
  fifo_fd=$((fd_min+i))
  : "Opening FD $fifo_fd for sort to $tempdir/fifo.$i"
  # in modern bash: exec {fifo_fd}> >(cloexec sort -u >$fifo_fd)
  printf -v exec_str 'exec %q> >(cloexec; exec sort -u >%q)' "$fifo_fd" "$tempdir/fifo.$i"
  eval "$exec_str"
done

# Run the big merge sort recombining output from all the FIFOs
cloexec "${merge_cmd[@]}" &
merge_pid=$!

# Split input stream out to all the individual sort processes...
awk -v "nprocs=$nprocs" \
    -v "fd_min=$fd_min" \
  '{ print $0 >("/dev/fd/" (fd_min + (NR % nprocs))) }'

# ...when done, close handles on the FIFOs, so their sort invocations exit
cloexec

# ...and wait for the merge sort to exit
wait "$merge_pid"

通过xargs进行显式排序并行化 - 来自xargs --max-procs的不完整结果

1 个答案: