如何使用数据

时间:2016-08-01 05:42:48

标签: hadoop apache-pig hdfs

我有一个包含以下样本数据的HDFS文件

  

id name timestamp
1 Lorem 2013-01-01
2 Ipsum 2013-02-01
  3 Ipsum 2013-03-01

现在我想将格式为/data/YYYY/MM/DD的多个目录中的数据拆分,例如记录1转到目录/data/2016/01/01

猪中有MultiStorage UDF,可以按年份或月份或日期拆分为单个目录。有什么方法可以分成多个目录吗?

2 个答案:

答案 0 :(得分:2)

您可以从以下三种方法中进行选择:

  1. 您可以编写shell脚本来执行此任务
  2. 您可以使用partition-er class
  3. 编写mapreduce作业
  4. 您可以创建hive分区表并按年,月和日应用分区,但是dir名称将以partition column name=作为dir名称中的前缀:/data/year=2016/month=01/date=07
  5. 让我知道您更喜欢哪种方法,我将基于此更新答案。

    使用shell脚本解决方案进行更新:

    在hdfs中给出两个具有相同内容的输入/源文件:

    [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/test_dir
    Found 2 items
    -rw-r--r--   1 cloudera cloudera         79 2016-08-02 04:43 /user/cloudera/test_dir/test.file1
    -rw-r--r--   1 cloudera cloudera         79 2016-08-02 04:43 /user/cloudera/test_dir/test.file2
    

    <强> 壳脚本:

    #!/bin/bash
    # Assuming src files are in hdfs, for local src file 
    # processing change the path and command accordingly
    # if you do NOT want to write header in each target file
    # then you can comment the writing header part from below script
    
    src_file_path='/user/cloudera/test_dir'
    trg_file_path='/user/cloudera/trgt_dir'
    
    src_files=`hadoop fs -ls ${src_file_path}|awk -F " " '{print $NF}'|grep -v items`
    
    for src_file in $src_files
    do
        echo processing ${src_file} file...
    
        while IFS= read -r line 
        do
           #ignore header from processing - that contains *id*
           if [[ $line != *"id"* ]];then
    
            DATE=`echo $line|awk -F " " '{print $NF}'`
            YEAR=`echo $DATE|awk -F "-" '{print $1}'`
            MONTH=`echo $DATE|awk -F "-" '{print $2}'`
            DAY=`echo $DATE|awk -F "-" '{print $3}'`
                    file_name="file_${DATE}"
    
            hadoop fs -test -d ${trg_file_path}/$YEAR/$MONTH/$DAY
    
            if [ $? != 0 ];then
                echo "dir not exist creating... ${trg_file_path}/$YEAR/$MONTH/$DAY "
                hadoop fs -mkdir -p ${trg_file_path}/$YEAR/$MONTH/$DAY
            fi
    
    
            hadoop fs -test -f ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
    
                    if [ $? != 0 ];then
                         echo "file not exist: creating header... ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name"
                         echo "id name timestamp" |hadoop fs -appendToFile - ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
                    fi
    
            echo "writing line: \'$line\' to file: ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name"
            echo $line |hadoop fs -appendToFile - ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
           fi
        done < <(hadoop fs -cat $src_file)
    done
    

    manageFiles.sh脚本运行为:

    [cloudera@quickstart ~]$ ./manageFiles.sh
    processing /user/cloudera/test_dir/test.file1 file...
    dir not exist creating... /user/cloudera/trgt_dir/2013/01/01 
    file not exist: creating header... /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
    writing line: '1 Lorem 2013-01-01'  to file: /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
    dir not exist creating... /user/cloudera/trgt_dir/2013/02/01 
    file not exist: creating header... /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
    writing line: '2 Ipsum 2013-02-01'  to file: /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
    dir not exist creating... /user/cloudera/trgt_dir/2013/03/01 
    file not exist: creating header... /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
    writing line: '3 Ipsum 2013-03-01'  to file: /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
    processing /user/cloudera/test_dir/test.file2 file...
    writing line: '1 Lorem 2013-01-01'  to file: /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
    writing line: '2 Ipsum 2013-02-01'  to file: /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
    writing line: '3 Ipsum 2013-03-01'  to file: /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
    
    [cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
    id name timestamp
    3 Ipsum 2013-03-01
    3 Ipsum 2013-03-01
    [cloudera@quickstart ~]$ 
    

答案 1 :(得分:0)

您可以在时间戳列上创建配置单元分区表,并使用 HCatStorer 仅将数据存储在pig中。

这样您可能无法获得所选目录,但可以按照要求在多个目录中获取数据。