Question

我试图让一个简单的PigActivity在Data Pipeline中工作。 http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-pigactivity.html#pigactivity

此活动需要输入和输出字段。我让它们都设置为使用S3DataNode。这两个DataNode都有一个指向我的s3输入和输出的directoryPath。我最初尝试使用filePath但出现以下错误：

PigActivity requires 'directoryPath' in 'Output' object.

我正在使用自定义猪脚本，也位于S3中。

我的问题是如何在脚本中引用这些输入和输出路径？

参考上给出的示例使用舞台字段（可以禁用/启用）。我的理解是，这用于将数据转换为表格。我不想这样做，因为它还要求您指定dataFormat字段。

Determines whether staging is enabled and allows your Pig script to have access to the staged-data tables, such as ${INPUT1} and ${OUTPUT1}.

我已禁用暂存，我正在尝试访问脚本中的数据，如下所示：

input = LOAD '$Input';

但是我收到以下错误：

IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : Input

我尝试过使用：

 input = LOAD '${Input}';

但我也得到了一个错误。

有可选的scriptVariable字段。我必须在这里使用某种映射吗？

Answer 1

只需使用

加载'uri to your s3'

应该有效。

通常这是在暂存（表创建）中为您完成的，您不必直接从脚本访问URI，只在S3DataNode中指定它。

Answer 2

确保已将“pigActivity”的“stage”属性设置为true。

一旦我这样做，下面的脚本开始为我工作：

part  = LOAD ${input1} USING PigStorage(',') AS (p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container);
grpd = GROUP part BY p_color;
${output1} = FOREACH grpd GENERATE group, COUNT(part);

使用AWS Data Pipeline PigActivity

2 个答案: