Question

我试图在函数中运行脚本然后调用它

filedetails ()
{
   # read TOTAL_DU < "/tmp/sizes.out";
    disksize=`du -s "$1" | awk '{print $1}'`;
    let TOTAL_DU+=$disksize;
    echo "$TOTAL_DU";
   # echo $TOTAL_DU > "/tmp/sizes.out"
}

我使用变量TOTAL_DU作为计数器来保持所有文件的du的计数

我使用parallel或xargs运行它

find . -type f | parallel -j 8 filedetails

但变量TOTAL_DU每次都会重置，并且不会保持计数，这与每次使用新shell时的预期一致。我也试过使用一个文件导出然后读取计数器，但因为并行一些比其他人更快完成所以它不顺序（如预期的那样）所以这不好.... 问题是有没有办法在使用parallel或xargs时保持计数

Answer 1

除了学习目的，这不太可能是parallel的好用，因为：

像这样调用du很可能比以正常方式调用du更慢。首先，可以从目录中提取有关文件大小的信息，因此可以在单个访问中计算整个目录。实际上，目录存储为一种特殊的文件对象，其数据是目录实体（“dirents”）的向量，其中包含每个文件的名称和元数据。你正在做的是使用find打印这些指针，然后让du解析每一个（每个文件，而不是每个目录）;几乎所有的第二次扫描都是多余的工作。
坚持du检查每个文件可防止它避免重复计算同一文件的多个硬链接。因此，您可以轻松地以这种方式最终扩大磁盘使用率。另一方面，目录也占用磁盘空间，通常du将在其报告中包含此空间。但是你永远不会在任何目录上调用它，所以你最终会低估总的磁盘使用量。
您为每个文件调用了一个shell和一个du实例。通常，您只需为单个du创建一个进程。进程创建比从目录中读取文件大小要慢得多。至少应该使用parallel -X并重写shell函数来调用所有参数的du，而不仅仅是$1。
无法在兄弟shell之间共享环境变量。因此，您必须将结果累积到持久性存储中，例如临时文件或数据库表。这也是一项昂贵的操作，但如果您采用上述建议，则每次调用du时只需要执行一次，而不是每个文件。

所以，忽略前两个问题，只看完最后两个问题，仅仅是出于教学目的，你可以做类似以下的事情：

# Create a temporary file to store results
tmpfile=$(mktemp)
# Function which invokes du and safely appends its summary line
# to the temporary file
collectsizes() {
  # Get the name of the temporary file, and remove it from the args
  tmpfile=$1
  shift
  # Call du on all the parameters, and get the last (grand total) line
  size=$(du -c -s "$@" | tail -n1)
  # lock the temporary file and append the dataline under lock
  flock "$tmpfile" bash -c 'cat "$1" >> "$2"' _ "$size" "$tmpfile"
}
export -f collectsizes

# Find all regular files, and feed them to parallel taking care
# to avoid problems if files have whitespace in their names
find -type f -print0 | parallel -0 -j8 collectsizes "$tmpfile"
# When all that's done, sum up the values in the temporary file
awk '{s+=$1}END{print s}' "$tmpfile"
# And delete it.
rm "$tmpfile"

在函数中维护变量 - 全局变量

1 个答案: