Question

我刚写了第一个猪脚本，它似乎没有取得任何进展。一些背景信息：

我在CentOS 6.4 VM上运行CDH4.5，所有这些都是从Cloudera的yum repo安装的。它配置为全部以伪分布式模式运行。一切都作为服务运行，似乎配置正确（谢天谢地！）

这是我的猪脚本：

A = LOAD '/user/msknapp/county_insurance_pp.txt' AS (fips:int,st:chararray,stfips:int,name:chararray,a:int,b:int,c:int,d:int,e:int,f:int,g:int);
DUMP A;

输入文件来自data.gov，它是一些保险数据。我预处理了它，这里有一些有用的信息：

[msknapp@localhost data]$ cat county_insurance_pp.txt | grep BUTLER
1013    AL  1   BUTLER  54480   129         3287        57895
19023   IA  19  BUTLER  27291   29659           3386    25150   85486
20015   KS  20  BUTLER  233855  10028       456 29278   5759    279376
21031   KY  21  BUTLER  4164                453     4617
29023   MO  29  BUTLER  48240   5217        738 2042    25081   81317
31023   NE  31  BUTLER  4406            153 609     5168
39017   OH  39  BUTLER  856205  103041  3854    38648   203328  19832   1224910
42019   PA  42  BUTLER  1072941 19131   190 60648   68692   50230   1271832
[msknapp@localhost data]$ hadoop fs -cat /user/msknapp/county_insurance_pp.txt | head 
1001    AL  1   AUTAUGA 215624  37156   46  130 53237   140420  446614
1003    AL  1   BALDWIN 1060297 95925   3284    31096   99241   200581  1490424
1005    AL  1   BARBOUR 37893   132     246 811     39082
1007    AL  1   BIBB    3127    70      241 34403       37841
1009    AL  1   BLOUNT  32311       135 11884   19392   4200    67922
1011    AL  1   BULLOCK 4301    336     274 186     5098
1013    AL  1   BUTLER  54480   129         3287        57895
1015    AL  1   CALHOUN 469959  92702   5373    2130    17069   532033  1119265
1017    AL  1   CHAMBERS    37238   3189        292 1953        42672
1019    AL  1   CHEROKEE    37984   190 117 1081    1277        40649
cat: Unable to write to output stream.

当我在命令行上运行pig脚本时，我得到了一大堆日志语句，看起来它正在运行，但一旦启动，无论我等待多久，它都不会取得任何进展。这是最后几行：

2014-01-05 15:10:41,113 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1388936205793_0006
2014-01-05 15:10:41,511 [JobControl] INFO  org.apache.hadoop.yarn.client.YarnClientImpl - Submitted application application_1388936205793_0006 to ResourceManager at /0.0.0.0:8032
2014-01-05 15:10:41,564 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8088/proxy/application_1388936205793_0006/
2014-01-05 15:10:41,653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete

我修改了pig脚本以指向我的本地文件系统文件，并以本地模式运行pig脚本，并且作业在几秒钟内成功完成。该文件的本地副本与hdfs具有的相同。我认为由于某些原因，猪无法与我的HDFS建立牢固的联系。

有人请告诉我我做错了什么吗？

Answer 1

也许试试：

    A = LOAD '/user/msknapp/county_insurance_pp.txt' USING PigStorage('\t') AS (fips:int,st:chararray,stfips:int,name:chararray,a:int,b:int,c:int,d:int,e:int,f:int,g:int);
    DUMP A;

猪没有取得任何进展

1 个答案: