PIG脚本:将包含开始日期和结束日期的单行展开为多行,每行一行

时间:2015-01-22 18:45:12

标签: apache-pig

我需要一个PIG脚本将包含广告系列ID,开始日期,结束日期和金额的单行转换为多行:每天一行,其中包含已分配到当天的金额。例如,架构是: campaignId,startDate,endDate,totalAmount

我的输入行有:

1,2015-01-01,2015-01-10,10000

我需要为此"广告系列"的每一天创建单独的行。将每天的totalAmount划分为这样的模式:

campaignId,date,amount

1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000

......等广告系列每一天的一行

我希望我可以使用嵌套的foreach和DaysBetween函数。

1 个答案:

答案 0 :(得分:1)

使用标准猪来解决这个问题有点困难,挑战将是两个日期之间的动态日期生成。假设月份是重叠的(ie, 2015-01-28 to 2015-02-06)那么猪从2月份开始4天和6天没有任何情报产生4天。

要解决此问题,一个选项可以是将日期生成部分移动到Custom UDF,解析输入并生成中间日期。

示例1:one input和日期为not overlapped

<强>输入

1,2015-01-01,2015-01-10,10000

<强> PigScript:

REGISTER PARSEDATE.jar; 
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');

<强>输出:

1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000

示例2: two inputsfirst inputnot overlappedsecond inputoverlapped

<强> INPUT1:

1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000

<强> PigScript:

REGISTER PARSEDATE.jar; 
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');

<强>输出:

1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000

您需要编译以下java代码并生成PARSEDATE.jar文件并将其包含到您的猪脚本中。我暂时写了这段代码,你可以根据需要进行优化。

<强> PARSEDATE.java

package mypackage;
import java.io.*;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.joda.time.LocalDate;
import org.joda.time.Days;

public class PARSEDATE extends EvalFunc<String> {
        public String exec(Tuple input) throws IOException {

                //Get the input String from request
                String inputString = (String)input.get(0);

                //Get Startdate from second column
                String startDate = inputString.split("_")[1];

                //Get enddate from third column
                String endDate = inputString.split("_")[2];

                LocalDate st = new LocalDate(startDate);
                LocalDate et = new LocalDate(endDate);

                //Calculate days between given dates
                int days = Days.daysBetween(st, et).getDays()+1;

                //Append all the dates as String
                String output="";
                for (int index=0; index < days; index++) 
                {
                   //Each dates are delimited by '#', so it will be easy to parse in the pig script.                     
                   output = output+"#"+st.plusDays(index).toString();
                }
                return output;
        }
}