我需要一个PIG脚本将包含广告系列ID,开始日期,结束日期和金额的单行转换为多行:每天一行,其中包含已分配到当天的金额。例如,架构是: campaignId,startDate,endDate,totalAmount
我的输入行有:
1,2015-01-01,2015-01-10,10000
我需要为此"广告系列"的每一天创建单独的行。将每天的totalAmount划分为这样的模式:
campaignId,date,amount
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
......等广告系列每一天的一行
我希望我可以使用嵌套的foreach和DaysBetween函数。
答案 0 :(得分:1)
使用标准猪来解决这个问题有点困难,挑战将是两个日期之间的动态日期生成。假设月份是重叠的(ie, 2015-01-28 to 2015-02-06
)那么猪从2月份开始4天和6天没有任何情报产生4天。
要解决此问题,一个选项可以是将日期生成部分移动到Custom UDF,解析输入并生成中间日期。
示例1:仅one input
和日期为not overlapped
<强>输入强>
1,2015-01-01,2015-01-10,10000
<强> PigScript:强>
REGISTER PARSEDATE.jar;
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');
<强>输出:强>
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
示例2: two inputs
,first input
为not overlapped
,second input
为overlapped
<强> INPUT1:强>
1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000
<强> PigScript:强>
REGISTER PARSEDATE.jar;
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');
<强>输出:强>
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000
您需要编译以下java代码并生成PARSEDATE.jar
文件并将其包含到您的猪脚本中。我暂时写了这段代码,你可以根据需要进行优化。
<强> PARSEDATE.java 强>
package mypackage;
import java.io.*;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.joda.time.LocalDate;
import org.joda.time.Days;
public class PARSEDATE extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
//Get the input String from request
String inputString = (String)input.get(0);
//Get Startdate from second column
String startDate = inputString.split("_")[1];
//Get enddate from third column
String endDate = inputString.split("_")[2];
LocalDate st = new LocalDate(startDate);
LocalDate et = new LocalDate(endDate);
//Calculate days between given dates
int days = Days.daysBetween(st, et).getDays()+1;
//Append all the dates as String
String output="";
for (int index=0; index < days; index++)
{
//Each dates are delimited by '#', so it will be easy to parse in the pig script.
output = output+"#"+st.plusDays(index).toString();
}
return output;
}
}