努力是能够在一个目录上做hadoop fs -ls格式的日期范围(如20170517到20170521)(/ a / b / c / d / e / f / g / h / test / exp_dt = YYYY-MM-DD)。有没有办法可以捕获/知道给定日期范围的目录是否属于您给出的时间戳范围。它将有助于区分分区是旧运行还是新运行相同的日期。
eg:
startdate=20180517
enddate=20180521
timestamp1=2018-05-18 13:00
timestamp2=2018-05-22 13:00
inputPath=/a/b/c/d/e/f/g/h/test/
hlsCmd=`hadoop fs -ls $inputPath | awk '{timestamp = $6 ; hourMin = $7 ; path = $8 ; print timestamp; print hourMin; print path; print ","}'`
echo $hlsCmd
ingestFlag=1
startdate=20180517
enddate=20180521
date="$enddate"
dates=()
for (( date="$enddate" , cnt=1, missCnt=0, foundCnt=0 ; $date >= $startdate ; date="$(date --date="$date - 1 days" +'%Y%m%d')" , cnt++));
do
dates+=( "$date" )
if [ $ingestFlag == 1 ]; then
curDate="$(date --date="$date" +'%Y-%m-%d')"
else
curDate="$(date --date="$date" +'%Y/%m/%d')"
fi;
curDateYYYYMMDD="$(date --date="$date" +'%Y%m%d')"
fmeYYYYMM="$(date --date="$date + 1 month" +'%Y%m')"
if echo "$hlsCmd" | grep -q "$curDate" ; then
((foundCnt++))
echo "$inputPath : $curDate found"
# echo "$inputPath : $curDate found" >> $foundFileName;
else
((missCnt++))
echo "$inputPath : $curDate missing $curDateYYYYMMDD"
echo "$inputPath : $curDate missing $curDateYYYYMMDD" >> $missingFileName;
fi;
output:
/a/b/c/d/e/f/g/h/test/ : 2018-05-21 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-20 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-19 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-18 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-17 found
sample output of $hlsCmd=, 2018-06-06 10:33 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-03 , 2018-06-07 12:30 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-04 , 2018-06-08 10:48 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-05 , 2018-06-08 14:38 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-06 , 2018-06-09 10:23 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-07 , 2018-06-10 11:13 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-08 , 2018-06-11 10:43 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-09 , 2018-06-12 11:16 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-10
阻断剂: 问题是上面代码中的awk可以与目录的时间戳(YYYY-MM-DD)进行模式匹配并抛出正结果。努力的目的是查看某个范围的目录是否属于某个时间戳。请告诉我可以做些什么。