Question

我在猪脚本中有以下关系：

my_relation: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,timeseries,([value#50.0,timestamp#1388675231000]))
(++JRGOCZQD,timeseries,([value#50.0,timestamp#1388592317000],[value#25.0,timestamp#1388682237000]))
(++GCYI1OO4,timeseries,())
(++JYY0LOTU,timeseries,())

bytearray列中可以有任意数量的值/时间戳对（甚至为零）。

我想将此关系转换为此关系（每个entityId，attributeName，value，timestamp四元组为一行）：

++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
++GCYI1OO4,timeseries,,
++JYY0LOTU,timeseries,,

或者这也没关系 - 我对没有值/时间戳

的行不感兴趣

++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000

有什么想法吗？基本上我想在bytearray列中规范化地图元组，以便模式如下：

my_relation: {entityId: chararray,
              attributeName: chararray, 
              value: float, 
              timestamp: int}

我是猪初学者，如果这很明显，那就很抱歉！我需要一个UDF才能这样做吗？

这个问题很相似，但到目前为止还没有答案：How do I split in Pig a tuple of many maps into different rows

我正在运行Apache Pig版本0.12.0-cdh5.1.2

编辑 - 添加到目前为止我所做的事情的详细信息。

这是一个猪脚本片段，输出如下：

-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java. 
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);

c_no_flatten = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  TOBAG($2 ..);

c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));

d = foreach c generate
  entityId,
  attributeName,
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;

输出如下。注意在关系'c'中，第二个值/时间戳对[值＃52.0，时间戳＃1388683516000]是如何丢失的。

(++JIYMIS2D,RechargeTimeSeries,([value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000]))
(++JRGOCZQD,RechargeTimeSeries,([value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries,())
a: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries)
b: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,{([value#50.0,timestamp#1388675231000])})
(++JRGOCZQD,RechargeTimeSeries,{([value#50.0,timestamp#1388592317000])})
(++GCYI1OO4,RechargeTimeSeries,{()})
c_no_flatten: {entityId: chararray,attributeName: chararray,{(bytearray)}}

(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000])
(++GCYI1OO4,RechargeTimeSeries,)
c: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,50.0,1388675231000)
(++JRGOCZQD,RechargeTimeSeries,50.0,1388592317000)
(++GCYI1OO4,RechargeTimeSeries,,)
d: {entityId: chararray,attributeName: chararray,value: float,timestamp: int}

Answer 1

这应该可以解决问题。首先，展平地图的元组以摆脱封装元组：

b = foreach a generate entityId, attributeName, FLATTEN($2);

现在我们可以将前两个字段中的所有字段转换为包。可以展平行李（请参阅http://pig.apache.org/docs/r0.12.0/basic.html#flatten）以获取每个值/时间戳对的行：

c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));

最后，从地图中获取所需的值：

d = foreach c generate
  entityId,
  attributeName,
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

<强>更新从地图元组中制作一袋地图的其他一些选项：

DataFu的TransposeTupleToBag：http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/util/TransposeTupleToBag.html
此答案中的foo() Python UDF：Pig - how to iterate on a bag of maps

如何规范化apache猪中的地图元组？

1 个答案: