PIG(v0.10.0)在FILTER操作期间除外:java.lang.Integer无法强制转换为java.lang.String

时间:2013-03-13 19:03:36

标签: exception hadoop mapreduce apache-pig

这是我的(看似琐碎的)PIG脚本,后面是它生成的异常:

raw_logs = LOAD './Apache-WebLog-Samples.d/access_log.txt' USING TextLoader() AS (line:chararray);

logs = FOREACH raw_logs GENERATE FLATTEN (
    REGEX_EXTRACT_ALL(line, '^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\\s+"(..*)"\\s+(\\S+)\\s+(\\S+)'))
       AS (remoteAddr:    chararray,
           remoteLogname: chararray,
           user:          chararray,
           date_time:     chararray, 
           request:       chararray,
           httpStatus:          int, <- Here's the problem. But goes away when I set to chararray.
           numBytes:            int);

httpGET200 = FILTER logs BY (request MATCHES '^GET\\s.*') AND (httpStatus == 200);

mylimit = LIMIT httpGET200 40;

DUMP mylimit;

PIG SCRIPT

java.lang.Exception: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String

[ ... non meaningful error output removed ... ]

2013-03-13 14:04:10,882 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.0.0-cdh4.2.0  0.10.0-cdh4.2.0 nmvega  2013-03-13 14:04:05 2013-03-13 14:04:10 FILTER,LIMIT

Failed!    
Failed Jobs:
JobId   Alias   Feature Message Outputs
job_local1982169921_0001    httpGET200,logs,mylimit,raw_logs        Message: Job failed!    

Input(s):
Failed to read data from "file:///home/user/Dropbox/CodeDEV.d/BIG-DATA-SNIPPETS.d/PIG.d/Apache-WebLog-Samples.d/access_log.txt"

Output(s):

例外消息

除了'httpGET200'关系外,一切都有效。由于我不明白的原因,条款“httpStatus == 200”会导致上述异常。当我删除该条款时,问题就消失了。或者,当我改变模式并声明'httpStatus'为“chararray”类型而不是“int”时(如上所述并且适用于HTTP状态代码),问题也消失了......(当然,当我这样做时,我必须编辑关系以插入引号,如下所示:httpStatus =='200')。

我检查了输入数据文件并验证了对于每一行,对应于'httpStatus'的字段确实总是一个整数(......好吧,一个表示整数的子字符串)。

顺便说一句,这样的模式是grunt报告的(即预期的内容):

grunt> describe httpGET200;
httpGET200: {remoteAddr: chararray,remoteLogname: chararray,user: chararray,date_time: chararray,request: chararray,httpStatus: int,numBytes: int}

我想了解这里发生的事情(无论是我的误解还是PIG限制)。谁能摆脱光明?

谢谢!

2 个答案:

答案 0 :(得分:5)

在我看来,如果REGEX_EXTRACT_ALL将输出模式中的字段设置为int,则在对该字段执行算术运算时将导致ClassCastException。可能是因为尽管存在给定的模式,所有字段仍保留并在被返回的元组内被视为chararray。

作为一种解决方法,您可以将所有字段设置为 chararray ,然后执行显式转换(转化):

logs = FOREACH raw_logs ....
conv = FOREACH logs generate remoteAddr, remoteLogname, user, date_time, 
         request, (int)httpStatus, (int)numBytes;

然后您可以应用最初使用的过滤器:

httpGET200 = FILTER conv BY (request MATCHES '^GET\\s.*') AND (httpStatus == 200);

您可以在this故障单中找到有关类似问题的更多信息:

答案 1 :(得分:1)

我在Pig脚本中尝试比较FILTER语句中的两个整数时遇到了同样的问题。 我找到的最优雅的解决方案是使用GenericInvoker。 所以对于你的问题我会用:

--StringToInt would be func that will invoke valueOf method of Integer class for String arg.
DEFINE StringToInt InvokeForInt('java.lang.Integer.valueOf', 'String');


--Now we can use it in our FILTER statement (without need to make projections in order to get right types schema for your tuples)
httpGET200=FILTER logs BY (request MATCHES '^GET\\s.*') AND StringToInt(httpStatus)== 200;

瞧!