猪加入没有返回任何结果

时间:2013-05-03 01:02:15

标签: hadoop amazon-web-services nosql apache-pig elastic-map-reduce

我现在已经坚持这个问题超过12个小时了。我有一个在Amazon Web Services上运行的Pig脚本。目前,我只是以交互模式运行我的脚本。我试图从气象站获取大量气候数据的平均值;但是,这些数据没有国家或州的信息,所以它必须与另一个表连接起来。

州表:

719990 99999 LILLOOET                      CN CA BC WKF   +50683 -121933 +02780
719994 99999 SEDCO 710                     CN CA    CWQJ  +46500 -048500 +00000
720000 99999 BOGUS AMERICAN                US US          -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE            US US LA K02R  +31400 -093283 +01410
720002 99999 HALLOCK(AWS)                  US US MN K03Y  +48783 -096950 +02500
720003 99999 DEER PARK(AWS)                US US WA K07S  +47967 -117433 +06720
720004 99999 MASON                         US US MI K09G  +42567 -084417 +02800
720005 99999 GASTONIA                      US US NC K0A6  +35200 -081150 +02440

气候表:(我意识到这并不包含满足连接条件的任何内容,但完整的数据集确实如此。)

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB      WDSP     MXSPD   GUST    MAX     MIN   PRCP   SNDP   FRSHTT
010010 99999  20090101    23.3 24    15.6 24  1033.2 24  1032.0 24   13.5  6    9.6 24   17.5  999.9    27.9*   16.7   0.00G 999.9  001000
010010 99999  20090102    27.3 24    20.5 24  1026.1 24  1024.9 24   13.7  5   14.6 24   23.3  999.9    28.9    25.3*  0.00G 999.9  001000
010010 99999  20090103    25.2 24    18.4 24  1028.3 24  1027.1 24   15.5  6    4.2 24    9.7  999.9    26.2*   23.9*  0.00G 999.9  001000
010010 99999  20090104    27.7 24    23.2 24  1019.3 24  1018.1 24    6.7  6    8.6 24   13.6  999.9    29.8    24.8   0.00G 999.9  011000
010010 99999  20090105    19.3 24    13.0 24  1015.5 24  1014.3 24    5.6  6   17.5 24   25.3  999.9    26.2*   10.2*  0.05G 999.9  001000
010010 99999  20090106    12.9 24     2.9 24  1019.6 24  1018.3 24    8.2  6   15.5 24   25.3  999.9    19.0*    8.8   0.02G 999.9  001000
010010 99999  20090107    26.2 23    20.7 23   998.6 23   997.4 23    6.6  6   12.1 22   21.4  999.9    31.5    19.2*  0.00G 999.9  011000
010010 99999  20090108    21.5 24    15.2 24   995.3 24   994.1 24   12.4  5   12.8 24   25.3  999.9    24.6*   19.2*  0.05G 999.9  011000
010010 99999  20090109    27.5 23    24.5 23   982.5 23   981.3 23    7.9  5   20.2 22   33.0  999.9    34.2    20.1*  0.00G 999.9  011000
010010 99999  20090110    22.5 23    16.7 23   977.2 23   976.1 23   11.9  6   15.5 23   35.0  999.9    28.9*   17.2   0.09G 999.9  000000

我使用TextLoader加载气候数据,应用正则表达式来获取字段,并从结果集中过滤掉空值。然后我对状态数据做同样的事情,但是我过滤了美国的国家。

行李箱有以下架构:     CLIMATE_REMOVE_EMPTY:{station:int,wban:int,year:int,month:int,day:int,temp:double}     STATES_FILTER_US:{station:int,wban:int,name:chararray,wmo:chararray,fips:chararray,state:chararray}

我需要在(station,wban)上执行连接操作,这样我就可以得到一个带有工作站,wban,年,月和临时数的结果包。当我在生成的包上执行转储时,它说它成功了;但是,转储返回0结果。这是输出。

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.3   0.9.2-amzn      hadoop  2013-05-03 00:10:51     2013-05-03 00:12:42         HASH_JOIN,FILTER

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime          MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
job_201305030005_0001   2       1       36      15      25      33      33      33              CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO    IN   HASH_JOIN       hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,

Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"

Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

我不知道为什么我的这个包含0个结果。我的数据提取似乎是正确的而且工作成功了。这让我相信连接条件永远不会得到满足。我知道输入文件有一些数据应该满足连接条件,但它什么都不返回。

唯一看起来可疑的是警告说: 遇到警告ACCESSING_NON_EXISTENT_FIELD 26001次。

我不确定从哪里开始。由于作业没有失败,我在调试中看不到任何错误或任何内容。

我不确定这些是否意味着什么,但这里有其他突出的东西: 当我尝试说明STATE_CLIMATE_JOIN时,我得到一个nullPointerException - ERROR 2997:遇到IOException。例外:null

当我试图说明STATES时,我得到java.lang.IndexOutOfBoundsException:Index:1,Size:1

这是我的完整代码:

--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();

--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);

CLIMATE= 
  FOREACH 
    RAW_CLIMATE
  GENERATE   
    FLATTEN ((tuple(int,int,int,int,int,double))
      EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
    ) 
    AS (
      station: int,
  wban: int,
  year: int,
  month: int,
  day: int,
  temp: double
    )
  ;

STATES= 
  FOREACH 
    RAW_STATES
  GENERATE   
    FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
      EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
    ) 
    AS (
      station: int,
  wban: int,
  name: chararray,
  wmo: chararray,
      fips: chararray,
      state: chararray
      )
    ;

CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);

提前致谢。我在这里不知所措。

- EDIT-- 我终于开始工作了!我解析STATE_DATA的正则表达式无效。

0 个答案:

没有答案