Question

我有一个分隔（|）输入文件（TableInfo.txt），其数据如下所示

dbName1|Table1
dbName1|Table2
dbName2|Table3
dbName2|Table4
...

我有一个shell脚本（LoadTables.sh），它解析每一行并调用从dbName，TableName这样的行传递args的可执行文件。此过程从SQL Server读取数据并将其加载到HDFS。

while IFS= read -r line;do
    fields=($(printf "%s" "$line"|cut -d'|' --output-delimiter=' ' -f1-))
    query=$(< ../sqoop/"${fields[1]}".sql)
    sh ../ProcessName "${fields[0]}" "${fields[1]}" "$query"
done < ../TableInfo.txt

现在我的进程按顺序为文件中的每一行运行，并且根据文件中的条目数量耗费时间。

有什么办法可以并行执行这个过程吗？我听说过使用xargs / GNU parallel / ampersand和wait选项。我不熟悉如何构建和使用它。任何帮助表示赞赏。

注意：我没有在Linux机器上安装GNU并行程序。所以xargs是唯一的选择，因为我听说过使用＆符号和等待选项的一些缺点。

Answer 1

将&放在要移动到后台的任何行的末尾。用read自己的字段拆分替换代码中使用的愚蠢（错误）数组拆分方法，这看起来像：

while IFS='|' read -r db table; do
    ../ProcessName "$db" "$table" "$(<"../sqoop/${table}.sql")" &
done < ../TableInfo.txt

... FYI，re：我的意思和＃34;马车＆＃34; -

fields=( $(foo) )

...不仅执行字符串分割，还对foo的输出进行通配;因此，输出中的*将替换为当前目录中的文件名列表; foo[bar]之类的名称可以替换为名为foob，fooa或foor的文件; globfail shell选项可能导致此类扩展导致失败，nullglob shell选项可能导致导致空结果;等

如果您有GNU xargs，请考虑以下事项：

# assuming you have "nproc" to get the number of CPUs; otherwise, hardcode
xargs -P "$(nproc)" -d $'\n' -n 1 bash -c '
  db=${1%|*}; table=${1##*|}
  query=$(<"../sqoop/${table}.sql")
  exec ../ProcessName "$db" "$table" "$query"
  ' _ < ../TableInfo.txt

对文件中的每一行并行运行Shell脚本

1 个答案: