给定一个带有Sqoop表的列表的文件,该脚本将启动带有选项列表的Sqoop导入命令。这里的intel是在“调度程序”中,我从here借来的,这意味着我希望脚本启动不超过最大数量的子进程,在变量中定义,监视它们并且只要一个他们完成后,启动另一个来填满队列。这样做直到表格结束为Sqoop。
脚本和调度程序正常工作,但脚本在子shell完成作业之前结束。
我尝试在脚本末尾插入wait
,但这样它等我按ENTER键。
我无法透露完整的剧本,对不起。无论如何,希望你理解它。
感谢您的帮助。
#!/bin/bash
# Script to parallel offloading RDB tables to Hive via Sqoop
confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# This file contains various configuration options, as long as "parallels",
# which is the number of concurrent jobs I want to launch
# Some nice functions.
usage () {
...
}
doSqoop() {
This function launches a Sqoop command compiled with informations extracted
# in the while loop. It also writes 2 log files and look for Sqoop RC.
}
queue() {
queue="$queue $1"
num=$(($num+1))
}
regeneratequeue() {
oldrequeue=$queue
queue=""
num=0
for PID in $oldrequeue
do
if [ -d /proc/"$PID" ] ; then
queue="$queue $PID"
num=$(($num+1))
fi
done
}
checkqueue() {
oldchqueue=$queue
for PID in $oldchqueue
do
if [ ! -d /proc/"$PID" ] ; then
regeneratequeue # at least one PID has finished
break
fi
done
}
# Check for mandatory values.
...
#### HeavyLifting ####
# Since I have a file containing the list of tables to Sqoop along with other
# useful arguments like sourceDB, sourceTable, hiveDB, HiveTable, number of parallels,
# etc, all in the same line, I use awk to grab them and then
# I pass them to the function doSqoop().
# So, here I:
# 1. create a temp folder
# 2. grab values from line with awk
# 3. launch doSqoop() as below:
# 4. Monitor spawned jobs
awk '!/^($|#)/' < "$listOfTables" | { while read -r line;
do
# look for the folder or create it
# .....
# extract values fro line with awk
# ....
# launch doSqoop() with this line:
(doSqoop) &
PID=$!
queue $PID
while [[ "$num" -ge "$parallels" ]]; do
checkqueue
sleep 0.5
done
done; }
# Here I tried to put wait, without success.
好的,所以我设法实现 DeeBee 建议的内容,据我所知,这是正确的。我没有实现 Duffy 所说的,因为我不太了解而且我没有时间ATM。
现在问题是我在 doSqoop 函数中移动了一些代码,并且无法创建日志所需的/ tmp文件夹。
我不明白什么是错的。这是代码,然后是错误。
请考虑查询参数很长并包含空格
#!/bin/bash
# Script to download lot of tables in parallel with Sqoop and write them to Hive
confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# TODO: delete sqoop tmp directory after jobs ends #
doSqoop() {
local origSchema="$1"
local origTable="$2"
local hiveSchema="$3"
local hiveTable="$4"
local splitColumn="$5"
local sqoopParallels="$6"
local query="$7"
local logFileSummary="$databaseBaseDir"/"$hiveTable"-summary.log
local logFileRaw="$databaseBaseDir/"$hiveTable"-raw.log
databaseBaseDir="$baseDir"/"$origSchema"-"$hiveSchema"
[ -d "$databaseBaseDir" ] || mkdir -p "$databaseBaseDir"
if [[ $? -ne 0 ]]; then
echo -e "Unable to complete the process. \n
Cannot create logs folder $databaseBaseDir"
exit 1
fi
echo "#### [$(date +%Y-%m-%dT%T)] Creating Hive table $hiveSchema.$hiveTable from source table $origSchema.$origTable ####" | tee -a "$logFileSummary" "$logFileRaw"
echo -e "\n\n"
quote="'"
sqoop import -Dmapred.job.queuename="$yarnQueue" -Dmapred.job.name="$jobName" \
--connect "$origServer" \
--username SQOOP --password-file file:///"$passwordFile" \
--delete-target-dir \
--target-dir "$targetTmpHdfsDir"/"$hiveTable" \
--outdir "$dirJavaCode" \
--hive-import \
--hive-database "$hiveSchema" \
--hive-table "$hiveTable" \
--hive-partition-key "$hivePartitionName" --hive-partition-value "$hivePartitionValue" \
--query "$quote $query where \$CONDITIONS $quote" \
--null-string '' --null-non-string '' \
--num-mappers 1 \
--fetch-size 2000000 \
--as-textfile \
-z --compression-codec org.apache.hadoop.io.compress.SnappyCodec |& tee -a "$logFileRaw"
sqoopRc=$?
if [[ $sqoopRc -ne 0 ]]; then
echo "[$(date +%Y-%m-%dT%T)] Error importing $hiveSchema.$hiveTable !" | tee -a "$logFileSummary" "$logFileRaw"
echo "$hiveSchema.$hiveTable" >> $databaseBaseDir/failed_imports.txt
fi
echo "Tail of : $logFileRaw" >> "$logFileSummary"
tail -10 "$logFileRaw" >> "$logFileSummary"
}
export -f doSqoop
# Check for mandatory values.
if [[ ! -f "$confFile" ]]; then
echo -e " $confFile does not appear to be a valid file.\n"
usage
fi
if [[ ! -f "$listOfTables" ]]; then
echo -e " $listOfTables does not appear to be a valid file.\n"
usage
fi
if [[ -z "${username+x}" ]]; then
echo -e " A valid username is required to access the Source.\n"
usage
fi
if [[ ! -f "$passwordFile" ]]; then
echo -e " Password File $password does not appear to be a valid file.\n"
usage
fi
if [[ -z "${origServer+x}" ]]; then
echo -e " Sqoop connection string is required.\n"
usage
fi
#### HeavyLifting ####
awk -F"|" '!/^($|#)/ {print $1 $2 $3 $4 $5 $6 $7}' < "$listOfTables" | xargs -n7 -P$parallels bash -c "doSqoop {}"
mkdir: cannot create directory `/{}-'mkdir: : Permission deniedcannot create directory `/{}-'
mkdir: : Permission denied
cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
mkdir: mkdir: cannot create directory `/{}-'cannot create directory `/{}-': Permission denied: Permission denied
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-
Unable to complete the process.
Cannot create logs folder /{}-
答案 0 :(得分:0)
由于您将doSqoop
推送到&
的后台作业,因此限制脚本执行时间的唯一因素是sleep 0.5
,并且运行checkqueue
需要很长时间
您是否考虑过使用xargs
并行运行该功能?
我认为近似您的用例的示例:
$ cat sqoop.bash
#!/bin/bash
doSqoop(){
local arg="${1}"
sleep $(shuf -i 1-10 -n 1) # random between 1 and 10 seconds
echo -e "${arg}\t$(date +'%H:%M:%S')"
}
export -f doSqoop # so xargs can use it
threads=$(nproc) # number of cpu cores
awk '{print}' < tables.list | xargs -n1 -P${threads} -I {} bash -c "doSqoop {}"
$ seq 1 15 > tables.list
结果:
$ ./sqoop.bash
3 11:29:14
4 11:29:14
8 11:29:14
9 11:29:15
11 11:29:15
1 11:29:20
2 11:29:20
6 11:29:21
14 11:29:22
7 11:29:23
5 11:29:23
13 11:29:23
15 11:29:24
10 11:29:24
12 11:29:24
有时让xargs
为你做的工作很好。
编辑:
示例将3个args传递给函数,最多并行8个操作:
$ cat sqoop.bash
#!/bin/bash
doSqoop(){
a="${1}"; b="${2}"; c="${3}"
sleep $(shuf -i 1-10 -n 1) # do some work
echo -e "$(date +'%H:%M:%S') $a $b $c"
}
export -f doSqoop
awk '{print $1,$3,$5}' tables.list | xargs -n3 -P8 -I {} bash -c "doSqoop {}"
$ cat tables.list
1a 1b 1c 1d 1e
2a 2b 2c 2d 2e
3a 3b 3c 3d 3e
4a 4b 4c 4d 4e
5a 5b 5c 5d 5e
6a 6b 6c 6d 6e
7a 7b 7c 7d 7e
$ ./sqoop.bash
09:46:57 1a 1c 1e
09:46:57 7a 7c 7e
09:47:05 3a 3c 3e
09:47:06 4a 4c 4e
09:47:06 2a 2c 2e
09:47:09 5a 5c 5e
09:47:09 6a 6c 6e
答案 1 :(得分:0)
使用GNU Parallel你可以做到:
export -f doSqoop
grep -Ev '^#' "$listOfTables" |
parallel -r --colsep '\|' -P$parallels doSqoop {}
如果您只想为每个CPU核心创建一个进程:
... | parallel -r --colsep '\|' doSqoop {}
答案 2 :(得分:0)
过了一段时间我现在有时间回答我的问题,因为我真的不想让其他人陷入这种问题。
我遇到过多个问题,与我的代码中的错误以及 xargs 的使用有关。事后看来,根据我的经验,我肯定建议不使用xargs来做这种事情。 Bash不是最适合这样做的语言,但是如果你被迫使用它,请考虑使用 GNU Parallel 。我很快就会把我的脚本移到这里。
关于问题:
-l1 -I args
。这样它将行作为单个参数处理,将其传递给函数(我用awk解析它们)。