我有一个包含以下样本数据的HDFS文件
id name timestamp
1 Lorem 2013-01-01
2 Ipsum 2013-02-01
3 Ipsum 2013-03-01
现在我想将格式为/data/YYYY/MM/DD
的多个目录中的数据拆分,例如记录1转到目录/data/2016/01/01
。
猪中有MultiStorage UDF,可以按年份或月份或日期拆分为单个目录。有什么方法可以分成多个目录吗?
答案 0 :(得分:2)
您可以从以下三种方法中进行选择:
partition column name=
作为dir名称中的前缀:/data/year=2016/month=01/date=07
让我知道您更喜欢哪种方法,我将基于此更新答案。
使用shell脚本解决方案进行更新:
在hdfs中给出两个具有相同内容的输入/源文件:
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/test_dir
Found 2 items
-rw-r--r-- 1 cloudera cloudera 79 2016-08-02 04:43 /user/cloudera/test_dir/test.file1
-rw-r--r-- 1 cloudera cloudera 79 2016-08-02 04:43 /user/cloudera/test_dir/test.file2
<强> 壳脚本: 强>
#!/bin/bash
# Assuming src files are in hdfs, for local src file
# processing change the path and command accordingly
# if you do NOT want to write header in each target file
# then you can comment the writing header part from below script
src_file_path='/user/cloudera/test_dir'
trg_file_path='/user/cloudera/trgt_dir'
src_files=`hadoop fs -ls ${src_file_path}|awk -F " " '{print $NF}'|grep -v items`
for src_file in $src_files
do
echo processing ${src_file} file...
while IFS= read -r line
do
#ignore header from processing - that contains *id*
if [[ $line != *"id"* ]];then
DATE=`echo $line|awk -F " " '{print $NF}'`
YEAR=`echo $DATE|awk -F "-" '{print $1}'`
MONTH=`echo $DATE|awk -F "-" '{print $2}'`
DAY=`echo $DATE|awk -F "-" '{print $3}'`
file_name="file_${DATE}"
hadoop fs -test -d ${trg_file_path}/$YEAR/$MONTH/$DAY
if [ $? != 0 ];then
echo "dir not exist creating... ${trg_file_path}/$YEAR/$MONTH/$DAY "
hadoop fs -mkdir -p ${trg_file_path}/$YEAR/$MONTH/$DAY
fi
hadoop fs -test -f ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
if [ $? != 0 ];then
echo "file not exist: creating header... ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name"
echo "id name timestamp" |hadoop fs -appendToFile - ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
fi
echo "writing line: \'$line\' to file: ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name"
echo $line |hadoop fs -appendToFile - ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
fi
done < <(hadoop fs -cat $src_file)
done
manageFiles.sh
脚本运行为:
[cloudera@quickstart ~]$ ./manageFiles.sh
processing /user/cloudera/test_dir/test.file1 file...
dir not exist creating... /user/cloudera/trgt_dir/2013/01/01
file not exist: creating header... /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
writing line: '1 Lorem 2013-01-01' to file: /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
dir not exist creating... /user/cloudera/trgt_dir/2013/02/01
file not exist: creating header... /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
writing line: '2 Ipsum 2013-02-01' to file: /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
dir not exist creating... /user/cloudera/trgt_dir/2013/03/01
file not exist: creating header... /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
writing line: '3 Ipsum 2013-03-01' to file: /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
processing /user/cloudera/test_dir/test.file2 file...
writing line: '1 Lorem 2013-01-01' to file: /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
writing line: '2 Ipsum 2013-02-01' to file: /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
writing line: '3 Ipsum 2013-03-01' to file: /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
[cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
id name timestamp
3 Ipsum 2013-03-01
3 Ipsum 2013-03-01
[cloudera@quickstart ~]$
答案 1 :(得分:0)
您可以在时间戳列上创建配置单元分区表,并使用 HCatStorer 仅将数据存储在pig中。
这样您可能无法获得所选目录,但可以按照要求在多个目录中获取数据。