在EMR上运行Pig脚本

时间:2015-02-27 02:08:54

标签: hadoop apache-pig emr amazon-emr

所以我使用以下文件作为输入: https://svn.apache.org/repos/asf/pig/trunk/tutorial/data/excite-small.log

我现在的代码是

-- FileName: excite-small.log
log  = LOAD 'excite-small.log' AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO 'output'

我使用http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-pig-launch.html

中提到的步骤在EMR上运行此作业

**我已设置以下参数**

1. For Script Location: s3://mybucket/test.pig
2. For Input Location:  s3://mybucket/excite-small.log
3. For Output Location: s3://mybucket/
4. Arguments: Blank

当我运行此作业时,我收到错误Input path does not exist。我认为这与REGISTER有关但我不太确定。任何人都可以建议我做错了吗?

1 个答案:

答案 0 :(得分:2)

在您的PIG脚本中,请完整参考输入文件,例如:

log  = LOAD 's3://mybucket/excite-small.log' AS (user, timestamp, query);

或者,使用传入的INPUT路径:

log = LOAD '$INPUT' AS (user, timestamp, query);

在这里找到了一个很好的解释: