Question

output_path=s3://output
unziped_dir=s3://2019-01-03
files=`hadoop fs -ls $output_path/ | awk '{print $NF}' | grep .gz$ | tr '\n' ' '`;
for f in $files
do   
echo "available files are: $f"
filename=$(hadoop fs -ls $f | awk -F '/' '{print $NF}' | head -1)
hdfs dfs -cat $f | gzip -d | hdfs dfs -put - $unziped_dir"/"${filename%.*}
echo "unziped file names: ${filename%.*}"
done

输出：

开发：

available files are: s3://2019-01-03/File_2019-01-03.CSV.gz
unziped file names: File_2019-01-03.CSV
available files are: s3://2019-01-03/Data_2019-01-03.CSV.gz
unziped file names: Data_2019-01-03.CSV
available files are: s3://2019-01-03/Output_2019-01-03.CSV.gz
unziped file names: Output_2019-01-03.CSV

产品：

available files are: s3://2019-01-03/File_2019-01-03.CSV.gz s3://2019-01-03/Data_2019-01-03.CSV.gz s3://2019-01-03/Output_2019-01-03.CSV.gz 
unziped file names:

我正在尝试查看目录并标识.gz文件，并对其进行迭代以将所有.gz文件解压缩并存储到其他目录中。但是，当在 EMR开发集群中运行此脚本时，它可以正常工作。但是在产品集群中却不是。请找到上面脚本的行为。

Answer 1

for f in $files中的单词split似乎存在问题。通常，shell应该像在Dev上一样在空格处分割值$files。在f循环的每个周期中，将Dev $files设置为for的三个词之一，而Prod f则获得$files的完整值包括空格。

您是否在某处设置了变量IFS？

如果问题不在脚本的其他部分，您应该可以使用简化的脚本重现该问题：

files="foo bar baz"
for f in $files
do   
  echo "available files are: $f"
done

如果此最小脚本没有区别，则问题出在脚本的其他部分。

要查看IFS的值在Dev和Prod上是否有所不同，可以将其添加到最小脚本或刚好在for循环之前的原始脚本中：

# To see if IFS is different. With the default value (space, tab, newline) the output should be
# 0000000   I   F   S   =   #      \t  \n   #  \n
# 0000012
echo "IFS=#${IFS}#" | od -c

如果看到IFS的值有所不同，则必须找出IFS的修改位置。

顺便说一句：通常，您可以在grep命令之后省略| tr '\n' ' '。在处理\n时，shell应该接受for f in $files作为分词字符。如果不是这样，则可能与问题的根源有关。

编辑：有更好的解决方案来逐行处理数据，请参见
https://mywiki.wooledge.org/DontReadLinesWithFor和
https://mywiki.wooledge.org/BashFAQ/001

您应该使用while read ...而不是for ...

修改后的脚本（未经测试）

output_path=s3://output
unziped_dir=s3://2019-01-03

hadoop fs -ls "$output_path"/ | awk '{print $NF}' | grep .gz$ | while IFS= read -r f
do   
    echo "available files are: $f"
    filename=$(hadoop fs -ls "$f" | awk -F '/' '{print $NF}' | head -1)
    hdfs dfs -cat "$f" | gzip -d | hdfs dfs -put - "${unziped_dir}/${filename%.*}"
    echo "unziped file names: ${filename%.*}"
done

当环境将Dev更改为Prod时，为什么此Unzip Shell脚本的行为会有所不同？

1 个答案: