我希望在Pig中实现以下功能。我有一组像这样的样本记录。
请注意,EffectiveDate列有时为空白,并且对于同一CustomerID也不同。
现在,作为输出,我希望每个CustomerID有一条记录,其中EffectiveDate是MAX。因此,对于上面的示例,我希望突出显示如下所示的记录。
我目前使用PIG的方式是:
customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);
--Group customer data by CustomerID
customerdata_grpd = GROUP customerdata BY CustomerID;
--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;
--Join the above with the original data so that we get the other details like CustomerName, Age etc.
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);
finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;
我基本上将原始数据分组以查找具有最大EffectiveDate的记录。然后我再次使用原始数据集加入这些“分组”记录,以获得具有最大生效日期的相同记录,但这次我还将获得其他数据,如客户名称,年龄和性别。这个数据集非常庞大,因此这种方法需要花费很多时间。有更好的方法吗?
答案 0 :(得分:4)
输入:
customer_data = LOAD 'customer_data.csv' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray);
customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date;
customer_data_grp_id = GROUP customer_data_fmt BY id;
req_data = FOREACH customer_data_grp_id {
customer_data_ordered = ORDER customer_data_fmt BY date DESC;
req_customer_data = LIMIT customer_data_ordered 1;
GENERATE FLATTEN(req_customer_data.id) AS id,
FLATTEN(req_customer_data.name) AS name,
FLATTEN(req_customer_data.gender) AS gender,
FLATTEN(req_customer_data.effective_date) AS effective_date;
};
猪脚本:
(1,John,M,1-Feb-15)
(2,Jane,F,5-Jun-15)
输出
@RequestMapping(value = "/configuration" , method = RequestMethod.POST, consumes = {MediaType.MULTIPART_FORM_DATA_VALUE})
public String setConfiguration(@RequestPart MultipartFile file)