在运行组命令

时间:2016-04-20 11:47:23

标签: hadoop mapreduce apache-pig

我使用以下命令加载了包含大约6000条数据行的文件

A = load '/home/hduser/hdfsdrive/piginput/data/airlines.dat' using PigStorage(',') as (Airline_ID:int, Name:chararray, Alias:chararray, IATA:chararray, ICAO:chararray, Callsign:chararray, Country:chararray, Active:chararray);
B = foreach airline generate Country,Airline_ID;
C = group B by Country;
D = foreach C generate group,COUNT(B);

在上面的代码中,我可以执行前3个命令而没有任何问题,但第4个命令运行了很长时间。我尝试了以下

dump C;

即使这个卡在同一个地方。这是日志:

  

2016-04-20 16:08:16,617 INFO org.apache.hadoop.util.NativeCodeLoader:   加载了native-hadoop库2016-04-20 16:08:16,898警告   org.apache.hadoop.metrics2.impl.MetricsSystemImpl:源名称ugi   已经存在! 2016-04-20 16:08:17,125 INFO   org.apache.hadoop.util.ProcessTree:setsid退出,退出代码为0   2016-04-20 16:08:17,129 INFO org.apache.hadoop.mapred.Task:使用   ResourceCalculatorPlugin:   org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1da9647b   2016-04-20 16:08:17,190 INFO org.apache.hadoop.mapred.ReduceTask:   ShuffleRamManager:MemoryLimit = 130652568,   MaxSingleShuffleLimit = 32663142 2016-04-20 16:08:17,195 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0线程已启动:线程为   合并磁盘文件2016-04-20 16:08:17,195 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0线程已启动:线程为   合并内存文件2016-04-20 16:08:17,195 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0线程等待:线程为   合并磁盘文件2016-04-20 16:08:17,196 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0需要另外1个地图输出   其中0已在进行中2016-04-20 16:08:17,196 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0线程已启动:线程为   polling Map Completion Events 2016-04-20 16:08:17,196 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0预定0输出(0慢主机   and0 dup hosts)2016-04-20 16:08:22,197 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0预定1输出(0慢主机   and0 dup hosts)2016-04-20 16:09:18,202 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0需要另外1个地图输出   其中1已在进行中2016-04-20 16:09:18,203 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0预定0输出(0慢主机   and0 dup hosts)2016-04-20 16:10:18,208 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0需要另外1个地图输出   其中1已在进行中2016-04-20 16:10:18,208 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0预定0输出(0慢主机   and0 dup hosts)2016-04-20 16:11:18,214 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0需要另外1个地图输出   其中1已在进行中2016-04-20 16:11:18,214 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0预定0输出(0慢主机   and0 dup hosts)2016-04-20 16:11:22,395警告   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0复制失败:   来自ubuntu的attempt_201604201138_0003_m_000000_0 2016-04-20   16:11:22,396 WARN org.apache.hadoop.mapred.ReduceTask:   java.net.SocketTimeoutException:connect time out out at   java.net.PlainSocketImpl.socketConnect(Native Method)at   java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)   在   java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)   在   java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)   在java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)at   java.net.Socket.connect(Socket.java:579)at   sun.net.NetworkClient.doConnect(NetworkClient.java:175)at   sun.net.www.http.HttpClient.openServer(HttpClient.java:432)at   sun.net.www.http.HttpClient.openServer(HttpClient.java:527)at   sun.net.www.http.HttpClient。(HttpClient.java:211)at   sun.net.www.http.HttpClient.New(HttpClient.java:308)at   sun.net.www.http.HttpClient.New(HttpClient.java:326)at at   sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:998)   在   sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:934)   在   sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:852)   在   org.apache.hadoop.mapred.ReduceTask $ ReduceCopier $ MapOutputCopier.getInputStream(ReduceTask.java:1636)   在   org.apache.hadoop.mapred.ReduceTask $ ReduceCopier $ MapOutputCopier.setupSecureConnection(ReduceTask.java:1593)   在   org.apache.hadoop.mapred.ReduceTask $ ReduceCopier $ MapOutputCopier.getMapOutput(ReduceTask.java:1493)   在   org.apache.hadoop.mapred.ReduceTask $ ReduceCopier $ MapOutputCopier.copyOutput(ReduceTask.java:1401)   在   org.apache.hadoop.mapred.ReduceTask $ ReduceCopier $ MapOutputCopier.run(ReduceTask.java:1333)   2016-04-20 16:11:22,398 INFO org.apache.hadoop.mapred.ReduceTask:任务   attempt_201604201138_0003_r_000000_0:从#获取#1失败   尝试_201604201138_0003_m_000000_0 2016-04-20 16:11:22,398警告   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0将主机ubuntu添加到惩罚   框,下次联系12秒2016-04-20 16:11:22,398 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0:从前一个获得1个地图输出   失败2016-04-20 16:11:37,399 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0预定1输出(0慢主机   and0 dup hosts)2016-04-20 16:12:19,403 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0需要另外1个地图输出   其中1已在进行中2016-04-20 16:12:19,403 INFO   org.apache.hadoop.mapred.ReduceTask:   attempt_201604201138_0003_r_000000_0预定0输出(0慢主机   and0 dup hosts)

即使我停止了所有工作并尝试重新启动,但没有用。我的环境是Ubuntu / Hadoop 1.2.1 / Pig 0.15.0

请帮忙。

谢谢,Sathish

2 个答案:

答案 0 :(得分:1)

我解决了这个问题。问题是/ etc / hosts中配置的IP地址不正确。我将此更新为分配给Ubuntu计算机的IP地址并重新启动Hadoop服务。我发现这与hadoop-hduser-jobtracker-ubuntu.log不匹配,它说:

ERROR in ./src/app/home/home.scss
Module parse failed: C:\src\app\home\home.scss Line 1: Unexpected token :
You may need an appropriate loader to handle this file type.
| $font-stack:    Helvetica, sans-serif;
| $primary-color: #333;
|
@ ./src/app/home/home.component.ts 49:20-42

在hadoop-hduser-datanode-ubuntu.log中,它抛出以下错误:

STARTUP_MSG:   host = ubuntu/10.1.0.249

基于这些错误,我可以用IP地址跟踪问题并将其修复到/ etc / hosts文件中,重新启动服务器。在此之后,所有Hadoop作业都在运行而没有任何问题,我可以运行加载数据并运行PIG脚本。

谢谢,Sathish。

答案 1 :(得分:0)

您是将数据加载到关系A但是从关系航空公司生成B?

B = foreach airline generate Country,Airline_ID;

应该是

B = foreach A generate Country,Airline_ID;

此外,如果您计算每个国家/地区的航空公司数量,则必须将关系D修改为

D = foreach C generate group as Country,COUNT(B.Airline_ID);