基于日期的猪数据清理

时间:2018-04-14 04:19:11

标签: apache-pig data-cleaning

我有2个数据集,如下所示:
1.身份证和地点

{ID, beginning year, ending year, location}. 

样品:

(1001, 2010, 2012, CA)
(1001, 2013, 2015, WA)
(1002, 2009, 2015, AZ)
(1003, 2014, 2015, FL)

2。 ID和连接

{ID1, ID2, connection creating date}

样品:

(1001, 1002, 2013)
(1001, 1003, 2014)

我想根据位置和年份来计算连接数。我假设一旦创建了连接,它就永不过期。我正在寻找的结果是

{Location 1, Location2, year, number of connections}

在上面的示例中,它应该是:

(WA, AZ,2013,1)
(WA, AZ,2014,1)
(WA, AZ,2015,1)
(WA, FL,2014,1)
(WA, FL,2015,1)

有人知道如何在Apache猪中实现这一目标吗?

1 个答案:

答案 0 :(得分:1)

正如您的评论所述,我们在某些时候需要转向年度信息。为了最大限度地减少数据大小膨胀的影响,我们需要在猪脚本中尽可能地将其移动到尽可能远的地方。 我们需要做的第一件事是以下数据转换:

{ID1, ID2, connection creating date} -> {Location1, Location2, start_year, end_year}

这可以通过以下猪脚本声明来实现:

locationData = LOAD 'path1' USING PigStorage('\t') AS (ID:chararray, beginning_year:long, ending_year:long, location:chararray);
connectionData = LOAD 'path2' USING PigStorage('\t') AS (ID1:chararray, ID2:chararray, connection_year:long);

partialJoin = JOIN connectionData USING ID1, locationData USING ID;
partialExtracted = FOREACH partialJoin GENERATE
                           ID2,
                           connection_year,
                           location AS location1,
                           (beginning_year > connection_year ? beginning_year : connection_year) AS start_year,
                           ending_year AS end_year;

fullJoin = JOIN partialExtracted USING ID2, locationData USING ID;
fullExtracted = FOREACH fullJoin GENERATE,
                           location1,
                           location AS location2,
                           (beginning_year > start_year ? beginning_year : start_year) AS start_year,
                           (ending_year < end_year ? ending_year : end_year ) AS end_year;

fullFiltered = FILTER fullExtracted BY (end_year < start_year);

我们现在准备爆炸数据以获取年度信息。从本质上讲,需要进行以下数据转换:

{Location1, Location2, start_year, end_year} -> {Location1, Location2, year}
e.g.
WA, AZ, 2013, 2015
->
WA, AZ, 2013
WA, AZ, 2014
WA, AZ, 2015

这里的UDF是不可避免的。我们需要一个UDF,它开始年份和结束年份并返回一个年度范围的包。您应该能够按照在线教程编写UDF。让我们说这个UDF叫做getYearRange()。您的脚本将如下所示:

fullExploded = FOREACH fullFiltered GENERATE
                       location1, location2,
                       FLATTEN(getYearRange(start_year, end_year)) AS year;

剩下的就是GROUP BY来获得你的最终计数:

fullGrouped = GROUP fullExploded BY (location1, location2, year);
finalOutput = FOREACH fullGrouped GENERATE 
              FLATTEN(group) AS (location1, location2, year),
              COUNT(fullExploded) AS count;

以上描述了数据流。您可能需要添加其他步骤来处理边缘情况并确保数据健全。