编辑（2）

Question

给定一个带有Sqoop表的列表的文件，该脚本将启动带有选项列表的Sqoop导入命令。这里的intel是在“调度程序”中，我从here借来的，这意味着我希望脚本启动不超过最大数量的子进程，在变量中定义，监视它们并且只要一个他们完成后，启动另一个来填满队列。这样做直到表格结束为Sqoop。

脚本和调度程序正常工作，但脚本在子shell完成作业之前结束。

我尝试在脚本末尾插入wait，但这样它等我按ENTER键。

我无法透露完整的剧本，对不起。无论如何，希望你理解它。

感谢您的帮助。

#!/bin/bash

# Script to parallel offloading RDB tables to Hive via Sqoop

confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# This file contains various configuration options, as long as "parallels",
#  which is the number of concurrent jobs I want to launch

# Some nice functions.
usage () {
  ...
}

doSqoop() {

  This function launches a Sqoop command compiled with informations extracted
# in the while loop. It also writes 2 log files and look for Sqoop RC.

}

queue() {
    queue="$queue $1"
    num=$(($num+1))
}

regeneratequeue() {
    oldrequeue=$queue
    queue=""
    num=0
    for PID in $oldrequeue
    do
        if [ -d /proc/"$PID"  ] ; then
            queue="$queue $PID"
            num=$(($num+1))
        fi
    done
}

checkqueue() {
    oldchqueue=$queue
    for PID in $oldchqueue
    do
        if [ ! -d /proc/"$PID" ] ; then
            regeneratequeue # at least one PID has finished
            break
        fi
    done
}

# Check for mandatory values.
 ...

#### HeavyLifting ####

# Since I have a file containing the list of tables to Sqoop along with other
# useful arguments like sourceDB, sourceTable, hiveDB, HiveTable, number of parallels,
# etc, all in the same line, I use awk to grab them and then
# I pass them to the function doSqoop().

# So, here I:
# 1. create a temp folder
# 2. grab values from line with awk
# 3. launch doSqoop() as below:
# 4. Monitor spawned jobs 

awk '!/^($|#)/' < "$listOfTables" | { while read -r line; 
do

  # look for the folder or create it
  # .....

  # extract values fro line with awk
  # ....

  # launch doSqoop() with this line:
  (doSqoop) &

  PID=$!
  queue $PID

  while [[ "$num" -ge "$parallels" ]]; do
    checkqueue
    sleep 0.5
  done

done; }
# Here I tried to put wait, without success.

编辑（2）

好的，所以我设法实现 DeeBee 建议的内容，据我所知，这是正确的。我没有实现 Duffy 所说的，因为我不太了解而且我没有时间ATM。

现在问题是我在 doSqoop 函数中移动了一些代码，并且无法创建日志所需的/ tmp文件夹。
我不明白什么是错的。这是代码，然后是错误。请考虑查询参数很长并包含空格

脚本

#!/bin/bash

# Script to download lot of tables in parallel with Sqoop and write them to Hive

confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# TODO: delete sqoop tmp directory after jobs ends #

doSqoop() {

  local origSchema="$1"
  local origTable="$2"
  local hiveSchema="$3"
  local hiveTable="$4"
  local splitColumn="$5"
  local sqoopParallels="$6"
  local query="$7"
  local logFileSummary="$databaseBaseDir"/"$hiveTable"-summary.log
  local logFileRaw="$databaseBaseDir/"$hiveTable"-raw.log

  databaseBaseDir="$baseDir"/"$origSchema"-"$hiveSchema"
  [ -d "$databaseBaseDir" ] || mkdir -p "$databaseBaseDir"
  if [[ $? -ne 0 ]]; then
    echo -e "Unable to complete the process. \n
    Cannot create logs folder $databaseBaseDir"
    exit 1
  fi

  echo "#### [$(date +%Y-%m-%dT%T)] Creating Hive table $hiveSchema.$hiveTable from source table $origSchema.$origTable ####" | tee -a "$logFileSummary" "$logFileRaw"
  echo -e "\n\n"

  quote="'"

  sqoop import -Dmapred.job.queuename="$yarnQueue" -Dmapred.job.name="$jobName" \
  --connect "$origServer" \
  --username SQOOP --password-file file:///"$passwordFile" \
  --delete-target-dir \
  --target-dir "$targetTmpHdfsDir"/"$hiveTable" \
  --outdir "$dirJavaCode" \
  --hive-import \
  --hive-database "$hiveSchema" \
  --hive-table "$hiveTable" \
  --hive-partition-key "$hivePartitionName" --hive-partition-value "$hivePartitionValue" \
  --query "$quote $query where \$CONDITIONS $quote" \
  --null-string '' --null-non-string '' \
  --num-mappers 1 \
  --fetch-size 2000000 \
  --as-textfile \
  -z --compression-codec org.apache.hadoop.io.compress.SnappyCodec |& tee -a "$logFileRaw"

  sqoopRc=$?
  if [[ $sqoopRc -ne 0 ]]; then 
    echo "[$(date +%Y-%m-%dT%T)] Error importing $hiveSchema.$hiveTable !" | tee -a "$logFileSummary" "$logFileRaw"
    echo "$hiveSchema.$hiveTable" >> $databaseBaseDir/failed_imports.txt 
  fi

  echo "Tail of : $logFileRaw" >> "$logFileSummary"
  tail -10 "$logFileRaw"  >> "$logFileSummary"
}
export -f doSqoop

# Check for mandatory values.
if [[ ! -f "$confFile" ]]; then
  echo -e "   $confFile does not appear to be a valid file.\n"
  usage
fi

if [[ ! -f "$listOfTables" ]]; then
  echo -e "   $listOfTables does not appear to be a valid file.\n"
  usage
fi

if [[ -z "${username+x}" ]]; then
  echo -e "   A valid username is required to access the Source.\n"
  usage
fi
if [[ ! -f "$passwordFile" ]]; then
  echo -e "   Password File $password does not appear to be a valid file.\n"
  usage
fi

if [[ -z "${origServer+x}" ]]; then
  echo -e "   Sqoop connection string is required.\n"
  usage
fi

#### HeavyLifting ####
awk -F"|" '!/^($|#)/ {print $1 $2 $3 $4 $5 $6 $7}' < "$listOfTables" | xargs -n7 -P$parallels bash -c "doSqoop {}"

错误

mkdir: cannot create directory `/{}-'mkdir: : Permission deniedcannot create directory `/{}-'
mkdir: : Permission denied
cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.

    Cannot create logs folder /{}-
mkdir: mkdir: cannot create directory `/{}-'cannot create directory `/{}-': Permission denied: Permission denied

Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-
Unable to complete the process.

    Cannot create logs folder /{}-

Answer 1

由于您将doSqoop推送到&的后台作业，因此限制脚本执行时间的唯一因素是sleep 0.5，并且运行checkqueue需要很长时间

您是否考虑过使用xargs并行运行该功能？

我认为近似您的用例的示例：

$ cat sqoop.bash
#!/bin/bash
doSqoop(){
  local arg="${1}"
  sleep $(shuf -i 1-10 -n 1)  # random between 1 and 10 seconds
  echo -e "${arg}\t$(date +'%H:%M:%S')"
}
export -f doSqoop  # so xargs can use it

threads=$(nproc)  # number of cpu cores
awk '{print}' < tables.list | xargs -n1 -P${threads} -I {} bash -c "doSqoop {}"

$ seq 1 15 > tables.list

结果：

$ ./sqoop.bash
3   11:29:14
4   11:29:14
8   11:29:14
9   11:29:15
11  11:29:15
1   11:29:20
2   11:29:20
6   11:29:21
14  11:29:22
7   11:29:23
5   11:29:23
13  11:29:23
15  11:29:24
10  11:29:24
12  11:29:24

有时让xargs为你做的工作很好。

编辑：

示例将3个args传递给函数，最多并行8个操作：

$ cat sqoop.bash
#!/bin/bash
doSqoop(){
  a="${1}"; b="${2}"; c="${3}"
  sleep $(shuf -i 1-10 -n 1)  # do some work
  echo -e "$(date +'%H:%M:%S') $a $b $c"
}
export -f doSqoop

awk '{print $1,$3,$5}' tables.list | xargs -n3 -P8 -I {} bash -c "doSqoop {}"

$ cat tables.list
1a 1b 1c 1d 1e
2a 2b 2c 2d 2e
3a 3b 3c 3d 3e
4a 4b 4c 4d 4e
5a 5b 5c 5d 5e
6a 6b 6c 6d 6e
7a 7b 7c 7d 7e

$ ./sqoop.bash
09:46:57 1a 1c 1e
09:46:57 7a 7c 7e
09:47:05 3a 3c 3e
09:47:06 4a 4c 4e
09:47:06 2a 2c 2e
09:47:09 5a 5c 5e
09:47:09 6a 6c 6e

Answer 2

使用GNU Parallel你可以做到：

export -f doSqoop
grep -Ev '^#' "$listOfTables" |
  parallel -r --colsep '\|' -P$parallels doSqoop {}

如果您只想为每个CPU核心创建一个进程：

  ... | parallel -r --colsep '\|' doSqoop {}

Answer 3

过了一段时间我现在有时间回答我的问题，因为我真的不想让其他人陷入这种问题。

我遇到过多个问题，与我的代码中的错误以及 xargs 的使用有关。事后看来，根据我的经验，我肯定建议不使用xargs来做这种事情。 Bash不是最适合这样做的语言，但是如果你被迫使用它，请考虑使用 GNU Parallel 。我很快就会把我的脚本移到这里。

关于问题：

我在向函数传递参数时遇到了问题。在第一个地方，因为它们包含特殊的字符，我没有注意到，然后因为我没有使用-I args。我解决了这个问题，从当时和使用xargs＆＃39;之间的换行清理输入行。选项-l1 -I args。这样它将行作为单个参数处理，将其传递给函数（我用awk解析它们）。
我试图实施的调度程序不起作用。我最终使用xargs来并行化函数内部的执行和自定义代码，以编写一些控制文件，帮助我理解（在脚本的最后）出了什么问题和什么有效。
Xargs没有提供收集单独工作的输出的工具。它只是将它转储到stdout上。我使用hadoop，我有很多输出，它只是一团糟。
同样，如果你将它与其他shell命令一起使用，例如find，cat，zip等，xargs就可以了。如果你有我的用例，不要使用它。不管怎样，你最后会变成一头白发。 Insted，花点时间学习GNU Parallel，或者更好地使用全功能语言（如果可以的话）。

子shell结束后，“wait”等待ENTER命令

编辑（2）

脚本

错误

3 个答案: