Pig DUMP卡在GROUP中

时间:2012-06-01 00:50:12

标签: hadoop apache-pig

我是PIG初学者(使用猪0.10.0),我有一些简单的JSON,如下所示:

test.json:

{
  "from": "1234567890",
  .....
  "profile": {
      "email": "me@domain.com"
      .....
  }
}

我在猪中进行一些小组/计数:

>pig -x local

使用以下PIG脚本:

REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;

users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);

domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */

grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */

基本上,当我尝试转储groups_domain_user时,猪会卡住,似乎在等待地图输出完成:

2012-05-31 17:45:22,111 [Thread-15] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:31,122 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:37,123 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:43,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:46,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:52,126 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:58,127 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:46:01,128 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
.... repeats ....

欢迎提出为何会发生这种情况的建议。

谢谢!

更新

克里斯为我解决了这个问题。我将fs.default.name等设置为pig.properties中的正确值,但我也将HADOOP_CONF_DIR环境变量设置为指向我的本地Hadoop安装,其中这些相同的值设置为{{ 1}}。

很棒的发现,非常感谢。

1 个答案:

答案 0 :(得分:3)

将这个问题标记为已回答,以及将来遇到的问题:

在本地模式下运行时(无论是通过pig -x local进行生猪还是提交地图减少作业到本地职业运动员,如果您看到减少阶段'挂起',尤其是当您看到条目时在日志中类似于:

2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - 
      attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress

然后你的工作虽然以本地模式启动,但可能已切换到“集群”模式,因为{HADOOP / conf / mapred-site.xml中mapred.job.tracker属性被标记为'final':

<property>
    <name>mapred.job.tracker</name>
    <value>hdfs://localhost:9000</value>
    <final>true</final>
</property>

您还应该检查core-site.xml中的fs.default.name属性,并确保它未标记为最终

这意味着您无法在运行时设置此值,甚至可能会看到类似于以下内容的错误消息:

12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name;  Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker;  Ignoring.