我有2个数据集,如下所示:
1.身份证和地点
{ID, beginning year, ending year, location}.
样品:
(1001, 2010, 2012, CA)
(1001, 2013, 2015, WA)
(1002, 2009, 2015, AZ)
(1003, 2014, 2015, FL)
2。 ID和连接
{ID1, ID2, connection creating date}
样品:
(1001, 1002, 2013)
(1001, 1003, 2014)
我想根据位置和年份来计算连接数。我假设一旦创建了连接,它就永不过期。我正在寻找的结果是
{Location 1, Location2, year, number of connections}
在上面的示例中,它应该是:
(WA, AZ,2013,1)
(WA, AZ,2014,1)
(WA, AZ,2015,1)
(WA, FL,2014,1)
(WA, FL,2015,1)
有人知道如何在Apache猪中实现这一目标吗?
答案 0 :(得分:1)
正如您的评论所述,我们在某些时候需要转向年度信息。为了最大限度地减少数据大小膨胀的影响,我们需要在猪脚本中尽可能地将其移动到尽可能远的地方。 我们需要做的第一件事是以下数据转换:
{ID1, ID2, connection creating date} -> {Location1, Location2, start_year, end_year}
这可以通过以下猪脚本声明来实现:
locationData = LOAD 'path1' USING PigStorage('\t') AS (ID:chararray, beginning_year:long, ending_year:long, location:chararray);
connectionData = LOAD 'path2' USING PigStorage('\t') AS (ID1:chararray, ID2:chararray, connection_year:long);
partialJoin = JOIN connectionData USING ID1, locationData USING ID;
partialExtracted = FOREACH partialJoin GENERATE
ID2,
connection_year,
location AS location1,
(beginning_year > connection_year ? beginning_year : connection_year) AS start_year,
ending_year AS end_year;
fullJoin = JOIN partialExtracted USING ID2, locationData USING ID;
fullExtracted = FOREACH fullJoin GENERATE,
location1,
location AS location2,
(beginning_year > start_year ? beginning_year : start_year) AS start_year,
(ending_year < end_year ? ending_year : end_year ) AS end_year;
fullFiltered = FILTER fullExtracted BY (end_year < start_year);
我们现在准备爆炸数据以获取年度信息。从本质上讲,需要进行以下数据转换:
{Location1, Location2, start_year, end_year} -> {Location1, Location2, year}
e.g.
WA, AZ, 2013, 2015
->
WA, AZ, 2013
WA, AZ, 2014
WA, AZ, 2015
这里的UDF是不可避免的。我们需要一个UDF,它开始年份和结束年份并返回一个年度范围的包。您应该能够按照在线教程编写UDF。让我们说这个UDF叫做getYearRange()。您的脚本将如下所示:
fullExploded = FOREACH fullFiltered GENERATE
location1, location2,
FLATTEN(getYearRange(start_year, end_year)) AS year;
剩下的就是GROUP BY来获得你的最终计数:
fullGrouped = GROUP fullExploded BY (location1, location2, year);
finalOutput = FOREACH fullGrouped GENERATE
FLATTEN(group) AS (location1, location2, year),
COUNT(fullExploded) AS count;
以上描述了数据流。您可能需要添加其他步骤来处理边缘情况并确保数据健全。